Use a generative AI foundation model for summarization and question answering using your own data

Use a generative AI foundation model for summarization and question answering using your own data

Large language models (LLMs) can be used to analyze complex documents and provide summaries and answers to questions. The post Domain-adaptation Fine-tuning of Foundation Models in Amazon SageMaker JumpStart on Financial data describes how to fine-tune an LLM using your own dataset. Once you have a solid LLM, you’ll want to expose that LLM to business users to process new documents, which could be hundreds of pages long. In this post, we demonstrate how to construct a real-time user interface to let business users process a PDF document of arbitrary length. Once the file is processed, you can summarize the document or ask questions about the content. The sample solution described in this post is available on GitHub.

Working with financial documents

Financial statements like quarterly earnings reports and annual reports to shareholders are often tens or hundreds of pages long. These documents contain a lot of boilerplate language like disclaimers and legal language. If you want to extract the key data points from one of these documents, you need both time and some familiarity with the boilerplate language so you can identify the interesting facts. And of course, you can’t ask an LLM questions about a document it has never seen.

LLMs used for summarization have a limit on the number of tokens (characters) passed into the model, and with some exceptions, these are typically no more than a few thousand tokens. That normally precludes the ability to summarize longer documents.

Our solution handles documents that exceed an LLM’s maximum token sequence length, and make that document available to the LLM for question answering.

Solution overview

Our design has three important pieces:

  • It has an interactive web application for business users to upload and process PDFs
  • It uses the langchain library to split a large PDF into more manageable chunks
  • It uses the retrieval augmented generation technique to let users ask questions about new data that the LLM hasn’t seen before

As shown in the following diagram, we use a front end implemented with React JavaScript hosted in an Amazon Simple Storage Service (Amazon S3) bucket fronted by Amazon CloudFront. The front-end application lets users upload PDF documents to Amazon S3. After the upload is complete, you can trigger a text extraction job powered by Amazon Textract. As part of the post-processing, an AWS Lambda function inserts special markers into the text indicating page boundaries. When that job is done, you can invoke an API that summarizes the text or answers questions about it.

Because some of these steps may take some time, the architecture uses a decoupled asynchronous approach. For example, the call to summarize a document invokes a Lambda function that posts a message to an Amazon Simple Queue Service (Amazon SQS) queue. Another Lambda function picks up that message and starts an Amazon Elastic Container Service (Amazon ECS) AWS Fargate task. The Fargate task calls the Amazon SageMaker inference endpoint. We use a Fargate task here because summarizing a very long PDF may take more time and memory than a Lambda function has available. When the summarization is done, the front-end application can pick up the results from an Amazon DynamoDB table.

For summarization, we use AI21’s Summarize model, one of the foundation models available through Amazon SageMaker JumpStart. Although this model handles documents of up to 10,000 words (approximately 40 pages), we use langchain’s text splitter to make sure that each summarization call to the LLM is no more than 10,000 words long. For text generation, we use Cohere’s Medium model, and we use GPT-J for embeddings, both via JumpStart.

Summarization processing

When handling larger documents, we need to define how to split the document into smaller pieces. When we get the text extraction results back from Amazon Textract, we insert markers for larger chunks of text (a configurable number of pages), individual pages, and line breaks. Langchain will split based on those markers and assemble smaller documents that are under the token limit. See the following code:

text_splitter = RecursiveCharacterTextSplitter(
      separators = ["<CHUNK>", "<PAGE>", "n"],
         chunk_size = int(chunk_size),
         chunk_overlap  = int(chunk_overlap))

 with open(local_path) as f:
     doc = f.read()
 texts = text_splitter.split_text(doc)
 print(f"Number of splits: {len(texts)}")


 llm = SageMakerLLM(endpoint_name = endpoint_name)

 responses = []
 for t in texts:
     r = llm(t)
     responses.append(r)
 summary = "n".join(responses)

The LLM in the summarization chain is a thin wrapper around our SageMaker endpoint:

class SageMakerLLM(LLM):

endpoint_name: str
    
@property
def _llm_type(self) -> str:
    return "summarize"
    
def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
    response = ai21.Summarize.execute(
                      source=prompt,
                      sourceType="TEXT",
                      sm_endpoint=self.endpoint_name
    )
    return response.summary 

Question answering

In the retrieval augmented generation method, we first split the document into smaller segments. We create embeddings for each segment and store them in the open-source Chroma vector database via langchain’s interface. We save the database in an Amazon Elastic File System (Amazon EFS) file system for later use. See the following code:

documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500,
                                                chunk_overlap  = 0)
texts = text_splitter.split_documents(documents)
print(f"Number of splits: {len(texts)}")

embeddings = SMEndpointEmbeddings(
    endpoint_name=endpoint_name,
)
vectordb = Chroma.from_documents(texts, embeddings, 
    persist_directory=persist_directory)
vectordb.persist()

When the embeddings are ready, the user can ask a question. We search the vector database for the text chunks that most closely match the question:

embeddings = SMEndpointEmbeddings(
    endpoint_name=endpoint_embed
)
vectordb = Chroma(persist_directory=persist_directory, 
embedding_function=embeddings)
docs = vectordb.similarity_search_with_score(question)

We take the closest matching chunk and use it as context for the text generation model to answer the question:

cohere_client = Client(endpoint_name=endpoint_qa)
context = docs[high_score_idx][0].page_content.replace("n", "")
qa_prompt = f'Context={context}nQuestion={question}nAnswer='
response = cohere_client.generate(prompt=qa_prompt, 
                                  max_tokens=512, 
                                  temperature=0.25, 
                                  return_likelihoods='GENERATION')
answer = response.generations[0].text.strip().replace('n', '')

User experience

Although LLMs represent advanced data science, most of the use cases for LLMs ultimately involve interaction with non-technical users. Our example web application handles an interactive use case where business users can upload and process a new PDF document.

The following diagram shows the user interface. A user starts by uploading a PDF. After the document is stored in Amazon S3, the user is able to start the text extraction job. When that’s complete, the user can invoke the summarization task or ask questions. The user interface exposes some advanced options like the chunk size and chunk overlap, which would be useful for advanced users who are testing the application on new documents.

User interface

Next steps

LLMs provide significant new information retrieval capabilities. Business users need convenient access to those capabilities. There are two directions for future work to consider:

  • Take advantage of the powerful LLMs already available in Jumpstart foundation models. With just a few lines of code, our sample application could deploy and make use of advanced LLMs from AI21 and Cohere for text summarization and generation.
  • Make these capabilities accessible to non-technical users. A prerequisite to processing PDF documents is extracting text from the document, and summarization jobs may take several minutes to run. That calls for a simple user interface with asynchronous backend processing capabilities, which is easy to design using cloud-native services like Lambda and Fargate.

We also note that a PDF document is semi-structured information. Important cues like section headings are difficult to identify programmatically, because they rely on font sizes and other visual indicators. Identifying the underlying structure of information helps the LLM process the data more accurately, at least until such time that LLMs can handle input of unbounded length.

Conclusion

In this post, we showed how to build an interactive web application that lets business users upload and process PDF documents for summarization and question answering. We saw how to take advantage of Jumpstart foundation models to access advanced LLMs, and use text splitting and retrieval augmented generation techniques to process longer documents and make them available as information to the LLM.

At this point in time, there is no reason not to make these powerful capabilities available to your users. We encourage you to start using the Jumpstart foundation models today.


About the author

Author pictureRandy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. In entered the Big Data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences including Strata and GlueCon.

Read More

Integrate Amazon SageMaker Model Cards with the model registry

Integrate Amazon SageMaker Model Cards with the model registry

Amazon SageMaker Model Cards enable you to standardize how models are documented, thereby achieving visibility into the lifecycle of a model, from designing, building, training, and evaluation. Model cards are intended to be a single source of truth for business and technical metadata about the model that can reliably be used for auditing and documentation purposes. They provide a factsheet of the model that is important for model governance.

Until now, model cards were logically associated to a model in the Amazon SageMaker Model Registry using model name match. However, when solving a business problem through a machine learning (ML) model, as customers iterate on the problem, they create multiple versions of the model and they need to operationalize and govern multiple model versions. Therefore, they need the ability to associate a model card to a particular model version.

In this post, we discuss a new feature that supports integrating model cards with the model registry at the deployed model version level. We discuss the solution architecture and best practices for managing model card versions, and walk through how to set up, operationalize, and govern the model card integration with the model version in the model registry.

Solution overview

SageMaker model cards help you standardize documenting your models from a governance perspective, and the SageMaker model registry helps you deploy and operationalize ML models. The model registry supports a hierarchical structure for organizing and storing ML models with model metadata information.

When an organization solves a business problem using ML, such as a customer churn prediction, we recommend the following steps:

  1. Create a model card for the business problem to be solved.
  2. Create a model package group for the business problem to be solved.
  3. Build, train, evaluate, and register the first version of the model package version (for example, Customer Churn V1).
  4. Update the model card linking the model package version to the model card.
  5. As you iterate on new model package version, clone the model card from the previous version and link to the new model package version (for example, Customer Churn V2).

The following figure illustrates how a SageMaker model card integrates with the model registry.

As illustrated in the preceding diagram, the integration of SageMaker model cards and the model registry allows you to associate a model card with a specific model version in the model registry. This enables you to establish a single source of truth for your registered model versions, with comprehensive and standardized documentation across all stages of the model’s journey on SageMaker, facilitating discoverability and promoting governance, compliance, and accountability throughout the model lifecycle.

Best practices for managing model cards

Operating in machine learning with governance is a critical requirement for many enterprise organizations today, notably in highly regulated industries. As part of those requirements, AWS provides several services that enable reliable operation of the ML environment.

SageMaker model cards document critical details about your ML models in a single place for streamlined governance and reporting. Model cards help you capture details such as the intended use and risk rating of a model, training details and metrics, evaluation results and observations, and additional call-outs such as considerations, recommendations, and custom information.

Model cards need to be managed and updated as part of your development process, throughout the ML lifecycle. They are an important part of continuous delivery and pipelines in ML. In the same way that a Well-Architected ML project implements continuous integration and continuous delivery (CI/CD) under the umbrella of MLOps, a continuous ML documentation process is a critical capability in a lot of regulated industries or for higher risk use cases. Model cards are part of the best practices for responsible and transparent ML development.

The following diagram shows how model cards should be part of a development lifecycle.

Consider the following best practices:

  • We recommend creating model cards early in your project lifecycle. In the first phase of the project, when you are working on identifying the business goal and framing the ML problem, you should initiate the creation of the model card. As you work through the different steps of business requirements and important performance metrics, you can create the model card in a draft status and determine the business details and intended uses.
  • As part of your model development lifecycle phase, you should use the model registry to catalog models for production, manage model versions, and associate metadata with a model. The model registry enables lineage tracking.
  • After you have iterated successfully and are ready to deploy your model to production, it’s time to update the model card. In the deployment lifecycle phase, you can update the model details of the model card. You should also update training details, evaluation details, ethical considerations, and caveats and recommendations.

Model cards have versions associated with them. A given model version is immutable across all attributes other than the model card status. If you make any other changes to the model card, such as evaluation metrics, description, or intended uses, SageMaker creates a new version of the model card to reflect the updated information. This is to ensure that a model card, once created, can’t be tampered with. Additionally, each unique model name can have only one associated model card and it can’t be changed after you create the model card.

ML models are dynamic and workflow automation components enable you to easily scale your ability to build, train, test, and deploy hundreds of models in production, iterate faster, reduce errors due to manual orchestration, and build repeatable mechanisms.

Therefore, the lifecycle of your model cards will look as described in the following diagram. Every time you update your model card through the model lifecycle, you automatically create a new version of the model card. Every time you iterate on a new model version, you create a new model card that can inherit some model card information of the previous model versions and follow the same lifecycle.

Pre-requisites

This post assumes that you already have models in your model registry. If you want to follow along, you can use the following SageMaker example on GitHub to populate your model registry: SageMaker Pipelines integration with Model Monitor and Clarify.

Integrate a model card with the model version in the model registry

In this example, we have the model-monitor-clarify-group package in our model registry.

In this package, two model versions are available.

For this example, we link Version 1 of the model to a new model card. In the model registry, you can see the details for Version 1.

We can now use the new feature in the SageMaker Python SDK. From the sagemaker.model_card ModelPackage module, you can select a specific model version from the model registry that you would like to link the model card to.

You can now create a new model card for the model version and specify the model_package_details parameter with the previous model package retrieved. You need to populate the model card with all the additional details necessary. For this post, we create a simple model card as an example.

You can then use that definition to create a model card using the SageMaker Python SDK.

When loading the model card again, you can see the associated model under "__model_package_details".

You also have the option to update an existing model card with the model_package as shown in the example code snippet below:

my_card = ModelCard.load(("<model_card_name>")
mp_details = ModelPackage.from_model_package_arn("<arn>")
my_card.model_package_details = mp_details
my_card.update()

Finally, when creating or updating a new model package version in an existing model package, if a model card already exists in that model package group, some information such as the business details and intended uses can be carried over to the new model card.

Clean up

Users are responsible for cleaning up resources if created using the notebook mentioned in the pre-requisites section. Please follow the instructions in the notebook to clean up resources.

Conclusion

In this post, we discussed how to integrate a SageMaker model card with a model version in the model registry. We shared the solution architecture with best practices for implementing a model card and showed how to set up and operationalize a model card to improve your model governance posture. We encourage you to try out this solution and share your feedback in the comments section.


About the Authors

Ram VittalRam Vittal is a Principal ML Solutions Architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 2-year-old sheep-a-doodle!

Natacha Fort is the Government Data Science Lead for Public Sector Australia and New Zealand, Principal SA at AWS. She helps organizations navigate their machine learning journey, supporting them from framing the machine learning problem to deploying into production, all the while making sure the best architecture practices are in place to ensure their success. Natacha focuses with organizations on MLOps and responsible AI.

Read More

Research Focus: Week of July 17, 2023

Research Focus: Week of July 17, 2023

Microsoft Research Focus 20 | Week of July 17, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

RetroRanker: leveraging reaction changes to improve retrosynthesis prediction through re-ranking

Retrosynthesis is an important task in organic chemistry. It’s designed to propose a list of candidate reactants that are likely to lead to a given product. Recent data-driven approaches to retrosynthesis have achieved promising results. However, they might make predictions based on the training data distribution, a phenomenon known as frequency bias, which can generate lower quality predictions.

In a new paper: RetroRanker: leveraging reaction changes to improve retrosynthesis prediction through re-ranking, researchers from Microsoft and academic colleagues introduce RetroRanker, a ranking model built upon graph neural networks, which is designed to mitigate frequency bias in predictions of existing retrosynthesis models. In order to lower the rankings of chemically unreasonable predictions, RetroRanker incorporates potential reaction changes of each set of predicted reactants in obtaining the given product. The predicted re-ranked results on publicly available retrosynthesis benchmarks show that RetroRanker can improve results on most state-of-the-art models. Preliminary studies also indicate that RetroRanker can enhance the performance of multi-step retrosynthesis.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

NEW RESEARCH

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is one of the most widely used. Yet despite its popularity, PPO may suffer from mode collapse, instability, and poor sample efficiency.

In a new paper: Fine-Tuning Language Models with Advantage-Induced Policy Alignment, researchers from Microsoft show that these issues can be alleviated by a novel algorithm called Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. This research demonstrates empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model’s initial policy, ensuring that the model improves its performance without collapsing to deterministic output. In addition to empirical results, the researchers also provide a theoretical justification supporting the design of their loss function.


NEW RESEARCH

A project-driven distributed energy resource dataset for the U.S. grid

Designing future energy systems to accommodate variable renewable energy and third-party owned devices requires information with high spatial and temporal granularity. Existing public datasets focus on specific resource classes (ex. bulk generators, residential solar, or electric vehicles), and are not useful for informing holistic planning or policy decisions. Further, with the growing presence of distributed energy resources (DERs) located in the distribution grid, datasets and models which focus only on the bulk system will no longer be sufficient.

In a new paper: Towards closing the data gap: A project-driven distributed energy resource dataset for the U.S. Grid, researchers from Microsoft address this modelling need with a project-driven dataset of DERs for the contiguous U.S., generated using only publicly available data. They integrate the resources into a high-resolution test system of the U.S. grid. This model, and the DER dataset, enable planners, operators, and policy makers to pose questions and conduct data-driven analysis of rapid decarbonization pathways for the electricity system. They further pose a set of research questions in their research project database.


NEW RESEARCH

End-to-end Privacy Preserving Training and Inference for Air Pollution Forecasting with Data from Rival Fleets

Privacy-preserving machine learning promises to train machine learning models by combining data spread across multiple data silos. Theoretically, secure multiparty computation (MPC) allows multiple data owners to train models on their joint data without revealing data to each other. However, prior implementations have had limitations affecting accuracy, breadth of supported models, and latency overheads that impact their relevance.

In a new paper: End-to-end Privacy Preserving Training and Inference for Air Pollution Forecasting with Data from Rival Fleets, researchers from Microsoft address the practical problem of secure training and inference of models for urban sensing problems. This includes traffic congestion estimation and air pollution monitoring in large cities, where data can be contributed by rival fleet companies while balancing the latency-accuracy trade-offs using MPC-based techniques.

This work includes a custom ML model that can be efficiently trained with MPC within a desirable latency, and an end-to-end system of private training and inference that provably matches the training accuracy of cleartext ML training. This trained model allows users to make sensitive queries in a privacy-preserving manner while carefully handling potentially invalid queries.


NEW RESEARCH

ASL Citizen – A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition

About 70 million deaf people worldwide use a sign language as their primary language, and at least 71 countries mandate the provision of services in sign language. Nonetheless, most existing information resources (like search engines or news sites) are written, and do not offer equitable access. Intelligent sign language systems could help expand access, but development has been impeded by a severe lack of appropriate data.

To help advance the state of sign language modeling, a team at Microsoft collaborated with colleagues at multiple institutions to create ASL Citizen, the first crowdsourced isolated sign language dataset. It contains about 84,000 videos of 2,700 distinct signs from American Sign Language (ASL), making it the largest isolated sign language recognition (ISLR) dataset available. Unlike prior datasets, it features everyday signers in everyday recording scenarios, and was collected with Deaf community involvement, consent, and compensation. The dataset improves state-of-the-art performance in single-sign recognition from about 30% accuracy to 63% accuracy, over a large vocabulary and tested on participants unseen in training.

This dataset is released alongside a new paper: ASL Citizen: A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition, which reframes ISLR as a dictionary retrieval task and establishes state-of-the-art baselines. Code and a searchable dictionary view of the crowdsourced dataset are also provided.


NEW RESOURCE

MABIM: Multi-agent Benchmark for Inventory Management

Multi-agent reinforcement learning (MARL) empowers multiple agents to accomplish shared objectives through collaboration and competition in specific environments. This approach has applications in diverse fields such as robotics, autonomous driving, gaming, economics, finance, and healthcare. The success of reinforcement learning algorithms depends on a variety of interactive learning environments. These environments enable agents to optimize decision-making strategies across numerous complex scenarios. Despite the emergence of various learning environments in the MARL domain, there remains a shortage of environments that address multiple challenges while offering flexible customization and expansion.

To tackle various MARL challenges, researchers from Microsoft recently released a versatile learning environment: Multi-agent Benchmark for Inventory Management (MABIM). Based on inventory management problems found in operations research, MABIM establishes a MARL benchmark evaluation framework that supports multi-echelon, multi-product inventory networks. This framework allows for the customization of diverse environments, simulating an array of challenging scenarios.

MABIM comprises 51 challenging tasks and includes features such as high operational efficiency, a Gym standard interface, comprehensive strategy visualization tools, and real-data-based capabilities to facilitate MARL research. Initial experiments using MABIM have revealed intriguing findings. For example, as the number of agents increases, the Independent Proximal Policy Optimization (IPPO) algorithm experiences difficulty training and the QTRAN algorithm becomes unstable. IPPO displays short-sighted behavior in resource-limited competitive environments, adopting long-term unprofitable strategies to evade immediate losses. Pure MARL algorithms have difficulty learning effective upstream and downstream strategies in environments that necessitate cooperation. In non-stationary environments, MARL strategies outperform conventional operations research algorithms.


NEW RESEARCH

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation

Recently, visual synthesis has attracted a great deal of interest in the field of generative models. Existing work has demonstrated the ability to generate high-quality images. However, videos in real applications are more challenging than images due to their length. A feature film typically runs more than 90 minutes. Cartoons often run for 30 minutes. Even for short video applications like TikTok, the recommended length is 21 to 34 seconds.

In a recent paper: NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation researchers from Microsoft propose a novel architecture for extremely long video generation. Most current work generates long videos segment-by-segment sequentially, which normally leads to the gap between training on short videos and inferring long videos, and the sequential generation is inefficient. Instead, this new approach adopts a coarse-to-fine process, in which the video can be generated in parallel at the same granularity. A global diffusion model is applied to generate the keyframes across the entire time range, and then local diffusion models recursively fill in the content between nearby frames. This simple yet effective strategy allows direct training on long videos to reduce the training-inference gap, and makes it possible to generate all segments in parallel.

The post Research Focus: Week of July 17, 2023 appeared first on Microsoft Research.

Read More

Sailing Seas of Data: Startup Charts Autonomous Oceanic Monitoring

Sailing Seas of Data: Startup Charts Autonomous Oceanic Monitoring

Saildrone is making a splash in autonomous oceanic monitoring.

The startup’s nautical data collection technology has tracked hurricanes up close in the North Atlantic, discovered a 3,200-foot underwater mountain in the Pacific Ocean and begun to help map the entirety of the world’s ocean floor.

Based in the San Francisco Bay Area, the company develops autonomous uncrewed surface vehicles (USVs) that carry a wide range of sensors. Its data streams are processed on NVIDIA Jetson modules for AI at the edge and are being optimized in prototypes with the NVIDIA DeepStream software development kit for intelligent video analytics.

Saildrone is seeking to make ocean intelligence collection cost-effective, offering data-gathering systems for science, fisheries, weather forecasting, ocean mapping and maritime security.

It has three different USVs, and its Mission Portal control center service is used for monitoring customized missions and visualizing data in near real time. Also, some of Saildrone’s historical data is freely available to the public.

“We’ve sailed into three major hurricanes, and right through the eye of Hurricane Sam, and all the vehicles came out the other side — they are pretty robust platforms,” said Blythe Towal, vice president of software engineering at Saildrone, referring to a powerful cyclone that threatened Bermuda in 2021 .

Saildrone, founded in 2012, has raised $190 million in funding. The startup is a member of NVIDIA Inception, a program that provides companies with technology support and AI platforms guidance.

Keeping an AI on Earth’s Waters

Saildrone is riding a wave of interest for use of its crewless data collection missions in environmental studies of oceans and lakes.

The University of Hawaii at Manoa has enlisted the help of three 23-foot Saildrone Explorer USVs to study the impact of ocean acidification on climate change. The six-month mission around the islands of Hawaii, Maui, Oahu and Kaui will be used to help evaluate the ocean’s health around the state.

Ocean acidification is a reduction in its pH, and contributing factors include the burning of fossil fuels and farming. These can have an impact on coral, oysters, clams, sea urchins and calcareous plankton, which can threaten marine ecosystems.

Saildrone recently partnered with Seabed 2030 to completely map the world’s oceans. Seabed 2030 is a collaboration between the Nippon Foundation and the General Bathymetric Chart of the Oceans, or GEBCO, to map ocean floors worldwide by 2030.

“Saildrone’s vision is of a healthy ocean and a sustainable planet,” said Saildrone founder and CEO Richard Jenkins. “A complete map of the ocean floor is fundamental to achieving that vision.”

The scientific community worldwide is embracing NVIDIA AI for climate studies, including for hyper-local climate modeling, AI to improve sequestering carbon, renewable energy research and many other areas. Dedicating its own expertise, NVIDIA is developing the world’s most powerful AI supercomputer for predicting climate change, named Earth-2, which will be used to create a digital twin of Earth in Omniverse.

Energy-Efficient Data Processing 

Saildrone USVs enable researchers to collect more data using fewer resources than traditional boats and crews, conserving energy and keeping crews out of danger.

The USVs are built for harsh weather and long missions. One of its USVs recently completed a 370-day voyage monitoring carbon dioxide, sailing from Rhode Island across the North Atlantic to Cabo Verde, down to the equator off the west coast of Africa, and back to Florida.

Running mostly on solar and wind power requires energy-efficient computing to handle so much data processing.

“With solar power, being able to keep our compute load power efficiency lower than a typical computing platform running GPUs by implementing NVIDIA Jetson is important for enabling us to do these kinds of missions,” said Towal.

Oceanic Surveying Meets Edge AI

Saildrone relies on the NVIDIA JetPack SDK for access to a full development environment for hardware-accelerated edge AI on the Jetson platform. It runs machine learning on the module for image-based vessel detection to aid navigation.

Saildrone pilots set waypoints and optimize the routes using metocean data — which includes meteorological and oceanographic information — returned from the vehicle. All of the USVs are monitored around the clock, and operators can change course remotely via the cloud if needed.

Machine learning is mostly run locally on the Jetson module— but can run on the cloud as well with a satellite connection — because bandwidth can be limited and costly to shuttle from its robust suite of sensors producing high-resolution imagery.

The USVs have oceanographic sensors for measurement of wind, temperature, salinity and dissolved carbon. The company also enables research of ocean and lake floors with bathymetric sensors, including deep sonar mapping with single- or multi-beam for going deeper or wider. And its perceptual sensor suite includes radar and visual underwater acoustic sensors.

DeepStream Goes Deep Sea

Saildrone taps into the NVIDIA DeepStream SDK for its vision AI applications and services. Developers can build seamless streaming pipelines for AI-based video, audio and image analytics using the kit.

Offering a 10x throughput improvement, DeepStream can be applied from edge to cloud to develop optimized intelligent video applications that handle multiple video, image and audio streams.

Saildrone will rely on DeepStream for image preprocessing and model inference, which enables machine learning at the edge, even at sea while powered by sun and wind.

Learn more about NVIDIA Jetson modules and the DeepStream SDK.

Read More

Enhance Amazon Lex with conversational FAQ features using LLMs

Enhance Amazon Lex with conversational FAQ features using LLMs

Amazon Lex is a service that allows you to quickly and easily build conversational bots (“chatbots”), virtual agents, and interactive voice response (IVR) systems for applications such as Amazon Connect.

Artificial intelligence (AI) and machine learning (ML) have been a focus for Amazon for over 20 years, and many of the capabilities that customers use with Amazon are driven by ML. Today, large language models (LLMs) are transforming the way developers and enterprises solve historically complex challenges related to natural language understanding (NLU). We announced Amazon Bedrock recently, which democratizes Foundational Model access for developers to easily build and scale generative AI-based applications, using familiar AWS tools and capabilities. One of the challenges enterprises face is to incorporate their business knowledge into LLMs to deliver accurate and relevant responses. When leveraged effectively, enterprise knowledge bases can be used to deliver tailored self-service and assisted-service experiences, by delivering information that helps customers solve problems independently and/or augmenting an agent’s knowledge. Today, a bot developer can improve self-service experiences without utilizing LLMs in a couple of ways. First, by creating intents, sample utterances, and responses, thereby covering all anticipated user questions within an Amazon Lex bot. Second, developers can also integrate bots with search solutions, which can index documents stored across a wide range of repositories and find the most relevant document to answer their customer’s question. These methods are effective, but require developer resources making getting started difficult.

One of the benefits offered by LLMs is the ability to create relevant and compelling conversational self-service experiences. They do so by leveraging enterprise knowledge base(s) and delivering more accurate and contextual responses. This blog post introduces a powerful solution for augmenting Amazon Lex with LLM-based FAQ features using the Retrieval Augmented Generation (RAG). We will review how the RAG approach augments Amazon Lex FAQ responses using your company data sources. In addition, we will also demonstrate Amazon Lex integration with LlamaIndex, which is an open-source data framework that provides knowledge source and format flexibility to the bot developer. As a bot developer gains confidence with using a LlamaIndex to explore LLM integration, they can scale the Amazon Lex capability further. They can also use enterprise search services such as Amazon Kendra, which is natively integrated with Amazon Lex.

In this solution, we showcase the practical application of an Amazon Lex chatbot with LLM-based RAG enhancement. We use the Zappos customer support use case as an example to demonstrate the effectiveness of this solution, which takes the user through an enhanced FAQ experience (with LLM), rather than directing them to fallback (default, without LLM).

Solution overview

RAG combines the strengths of traditional retrieval-based and generative AI based approaches to Q&A systems. This methodology harnesses the power of large language models, such as Amazon Titan or open-source models (for example, Falcon), to perform generative tasks in retrieval systems. It also takes into account the semantic context from stored documents more effectively and efficiently.

RAG starts with an initial retrieval step to retrieve relevant documents from a collection based on the user’s query. It then employs a language model to generate a response by considering both the retrieved documents and the original query. By integrating RAG into Amazon Lex, we can provide accurate and comprehensive answers to user queries, resulting in a more engaging and satisfying user experience.

The RAG approach requires document ingestion so that embeddings can be created to enable LLM-based search. The following diagram shows how the ingestion process creates the embeddings that are then used by the chatbot during fallback to answer the customer’s question.

With this solution architecture, you should choose the most suitable LLM for your use case. It also provides an inference endpoint choice between Amazon Bedrock (in limited preview) and models hosted on Amazon SageMaker JumpStart, offering additional LLM flexibility.

The document is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket. The S3 bucket has an event listener attached that invokes an AWS Lambda function on changes to the bucket. The event listener ingests the new document and places the embeddings in another S3 bucket. The embeddings are then used by the RAG implementation in the Amazon Lex bot during the fallback intent to answer the customer’s question. The next diagram shows the architecture of how an FAQ bot within Lex can be enhanced with LLMs and RAG.

Let’s explore how we can integrate RAG based on LlamaIndex into an Amazon Lex bot. We provide code examples and an AWS Cloud Development Kit (AWS CDK) import to assist you in setting up the integration. You can find the code examples in our GitHub repository. The following sections provide a step-by-step guide to help you set up the environment and deploy the necessary resources.

How RAG works with Amazon Lex

The flow of RAG involves an iterative process where the retriever component retrieves relevant passages, the question and passages help construct the prompt, and the generation component produces a response. This combination of retrieval and generation techniques allows the RAG model to take advantage of the strengths of both approaches, providing accurate and contextually appropriate answers to user questions. The workflow provides the following capabilities:

  • Retriever engine – The RAG model begins with a retriever component responsible for retrieving relevant documents from a large corpus. This component typically uses an information retrieval technique like TF-IDF or BM25 to rank and select documents that are likely to contain the answer to a given question. The retriever scans the document corpus and retrieves a set of relevant passages.
  • Prompt helper – After the retriever has identified the relevant passages, the RAG model moves to prompt creation. The prompt is a combination of the question and the retrieved passages, serving as additional context for the prompt, which is used as input to the generator component. To create the prompt, the model typically augments the question with the selected passages in a specific format.
  • Response generation – The prompt, consisting of the question and relevant passages, is fed into the generation component of the RAG model. The generation component is usually a language model capable of reasoning through the prompt to generate a coherent and relevant response.
  • Final response – Finally, the RAG model selects the highest-ranked answer as the output and presents it as the response to the original question. The selected answer can be further postprocessed or formatted as necessary before being returned to the user. In addition, the solution enables the filtering of the generated response if the retrieval results yields a low confidence score, implying that it likely falls outside the distribution (OOD).

LlamaIndex: An open-source data framework for LLM-based applications

In this post, we demonstrate the RAG solution based on LlamaIndex. LlamaIndex is an open-source data framework specifically designed to facilitate LLM-based applications. It offers a robust and scalable solution for managing document collection in different formats. With LlamaIndex, bot developers are empowered to effortlessly integrate LLM-based QA (question answering) capabilities into their applications, eliminating the complexities associated with managing solutions catered to large-scale document collections. Furthermore, this approach proves to be cost-effective for smaller-sized document repositories.

Prerequisites

You should have the following prerequisites:

Set up your development environment

The main third-party package requirements are llama_index and sagemaker sdk. Follow the specified commands in our GitHub repository’s README to set up your environment properly.

Deploy the required resources

This step involves creating an Amazon Lex bot, S3 buckets, and a SageMaker endpoint. Additionally, you need to Dockerize the code in the Docker image directory and push the images to Amazon Elastic Container Registry (Amazon ECR) so that it can run in Lambda. Follow the specified commands in our GitHub repository’s README to deploy the services.

During this step, we demonstrate LLM hosting via SageMaker Deep Learning Containers. Adjust the settings according to your computation needs:

  • Model – To find a model that meets your requirements, you can explore resources like the Hugging Face model hub. It offers a variety of models such as Falcon 7B or Flan-T5-XXL. Additionally, you can find detailed information about various officially supported model architectures, helping you make an informed decision. For more information about different model types, refer to optimized architectures.
  • Model inference endpoint – Define the path of the model (for example, Falcon 7B), choose your instance type (for example, g5.4xlarge), and use quantization (for example, int-8 quantization).Note: This solution provides you the flexibility to choose another model inferencing endpoint. You can also use Amazon Bedrock, which provides access to other LLMs such as Amazon Titan.Note: This solution provides you the flexibility to choose another model inferencing endpoint. You can also use Amazon Bedrock, which provides access to other LLMs such as Amazon Titan.

Set up your document index via LlamaIndex

To set up your document index, first upload your document data. We assume that you have the source of your FAQ content, such as a PDF or text file.

After the document data is uploaded, the LlamaIndex system will automatically initiate the process of creating the document index. This task is performed by a Lambda function, which generates the index and saves it to an S3 bucket.

To enable efficient retrieval of relevant information, configure the document retriever using the LlamaIndex Retriever Query Engine. This engine offers several customization options, such as the following:

  • Embedding models – You can choose your embedding model, such as Hugging Face embedding.
  • Confidence cutoff – Specify a confidence cutoff threshold to determine the quality of retrieval results. If the confidence score falls below this threshold, you can choose to provide out-of-scope responses, indicating that the query is beyond the scope of the indexed documents.

Test the integration

Define your bot definition with a fallback intent and use the Amazon Lex console to test your FAQ requests. For more details, please refer to GitHub repository. The following screenshot shows an example conversation with the bot.

Tips to boost your bot efficiency

The following tips could potentially further improve the efficiency of your bot:

  • Index storage – Store your index in an S3 bucket or a service with vector database capabilities such as Amazon OpenSearch. By utilizing cloud-based storage solutions, you can enhance the accessibility and scalability of your index, leading to faster retrieval times and improved overall performance. Also, Refer to this blog post for an Amazon Lex bot that utilizes an Amazon Kendra search solution.
  • Retrieval optimization – Experiment with different sizes of embedding models for the retriever. The choice of embedding model can significantly impact the input requirements of your LLM. Finding the optimal balance between model size and retrieval performance can result in improved efficiency and faster response times.
  • Prompt engineering – Experiment with different prompt formats, lengths, and styles to optimize the performance and quality of your bot’s answers.
  • LLM model selection – Select the most suitable LLM model for your specific use case. Consider factors such as model size, language capabilities, and compatibility with your application requirements. Choosing the right LLM model ensures optimal performance and efficient utilization of system resources.

Contact center conversations can span from self-service to a live human interaction. For use cases involving human-to-human interactions over Amazon Connect, you can use Wisdom to search and find content across multiple repositories, such as frequently asked questions (FAQs), wikis, articles, and step-by-step instructions for handling different customer issues.

Clean up

To avoid incurring future expenses, proceed with deleting all the resources that were deployed as part of this exercise. We have provided a script to shut down the SageMaker endpoint gracefully. Usage details are in the README. Additionally, to remove all the other resources you can run cdk destroy in the same directory as the other cdk commands to deprovision all the resources in your stack.

Summary

This post discussed the following steps to enhance Amazon Lex with LLM-based QA features using the RAG strategy and LlamaIndex:

  • Install the necessary dependencies, including LlamaIndex libraries
  • Set up model hosting via Amazon SageMaker or Amazon Bedrock (in limited preview)
  • Configure LlamaIndex by creating an index and populating it with relevant documents
  • Integrate RAG into Amazon Lex by modifying the configuration and configuring RAG to use LlamaIndex for document retrieval
  • Test the integration by engaging in conversations with the chatbot and observing its retrieval and generation of accurate responses

By following these steps, you can seamlessly incorporate powerful LLM-based QA capabilities and efficient document indexing into your Amazon Lex chatbot, resulting in more accurate, comprehensive, and contextually aware interactions with users. As a follow up, we also invite you to review our next blog post, which explores enhancing the Amazon Lex FAQ experience using URL ingestion and LLMs.


About the authors

Max Henkel-Wallace is a Software Development Engineer at AWS Lex. He enjoys working leveraging technology to maximize customer success. Outside of work he is passionate about cooking, spending time with friends, and backpacking.

Song Feng is a Senior Applied Scientist at AWS AI Labs, specializing in Natural Language Processing and Artificial Intelligence. Her research explores various aspects of these fields including document-grounded dialogue modeling, reasoning for task-oriented dialogues, and interactive text generation using multimodal data.

Saket Saurabh is an engineer with AWS Lex team. He works on improving Lex developer experience to help developers build more human-like chat bots. Outside of work, he enjoys traveling, discovering diverse cuisines, and learn about different cultures.

f

Read More

Enhance Amazon Lex with LLMs and improve the FAQ experience using URL ingestion

Enhance Amazon Lex with LLMs and improve the FAQ experience using URL ingestion

In today’s digital world, most consumers would rather find answers to their customer service questions on their own rather than taking the time to reach out to businesses and/or service providers. This blog post explores an innovative solution to build a question and answer chatbot in Amazon Lex that uses existing FAQs from your website. This AI-powered tool can provide quick, accurate responses to real-world inquiries, allowing the customer to quickly and easily solve common problems independently.

Single URL ingestion

Many enterprises have a published set of answers for FAQs for their customers available on their website. In this case, we want to offer customers a chatbot that can answer their questions from our published FAQs. In the blog post titled Enhance Amazon Lex with conversational FAQ features using LLMs, we demonstrated how you can use a combination of Amazon Lex and LlamaIndex to build a chatbot powered by your existing knowledge sources, such as PDF or Word documents. To support a simple FAQ, based on a website of FAQs, we need to create an ingestion process that can crawl the website and create embeddings that can be used by LlamaIndex to answer customer questions. In this case, we will build on the bot created in the previous blog post, which queries those embeddings with a user’s utterance and returns the answer from the website FAQs.

The following diagram shows how the ingestion process and the Amazon Lex bot work together for our solution.

In the solution workflow, the website with FAQs is ingested via AWS Lambda. This Lambda function crawls the website and stores the resulting text in an Amazon Simple Storage Service (Amazon S3) bucket. The S3 bucket then triggers a Lambda function that uses LlamaIndex to create embeddings that are stored in Amazon S3. When a question from an end-user arrives, such as “What is your return policy?”, the Amazon Lex bot uses its Lambda function to query the embeddings using a RAG-based approach with LlamaIndex. For more information about this approach and the pre-requisites, refer to the blog post, Enhance Amazon Lex with conversational FAQ features using LLMs.

After the pre-requisites from the aforementioned blog are complete, the first step is to ingest the FAQs into a document repository that can be vectorized and indexed by LlamaIndex. The following code shows how to accomplish this:

import logging
import sys
import requests
import html2text
from llama_index.readers.schema.base import Document
from llama_index import GPTVectorStoreIndex
from typing import List

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


class EZWebLoader:

def __init__(self, default_header: str = None):
self._html_to_text_parser = html2text()
if default_header is None:
self._default_header = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}
else:
self._default_header = default_header

def load_data(self, urls: List[str], headers: str = None) -> List[Document]:
if headers is None:
headers = self._default_header

documents = []
for url in urls:
response = requests.get(url, headers=headers).text
response = self._html2text.html2text(response)
documents.append(Document(response))
return documents

url = "http://www.zappos.com/general-questions"
loader = EZWebLoader()
documents = loader.load_data([url])
index = GPTVectorStoreIndex.from_documents(documents)

In the preceding example, we take a predefined FAQ website URL from Zappos and ingest it using the EZWebLoader class. With this class, we have navigated to the URL and loaded all the questions that are in the page into an index. We can now ask a question like “Does Zappos have gift cards?” and get the answers directly from our FAQs on the website. The following screenshot shows the Amazon Lex bot test console answering that question from the FAQs.

We were able to achieve this because we had crawled the URL in the first step and created embedddings that LlamaIndex could use to search for the answer to our question. Our bot’s Lambda function shows how this search is run whenever the fallback intent is returned:

import time
import json
import os
import logging
import boto3
from llama_index import StorageContext, load_index_from_storage


logger = logging.getLogger()
logger.setLevel(logging.DEBUG)


def download_docstore():
# Create an S3 client
s3 = boto3.client('s3')

# List all objects in the S3 bucket and download each one
try:
bucket_name = 'faq-bot-storage-001'
s3_response = s3.list_objects_v2(Bucket=bucket_name)

if 'Contents' in s3_response:
for item in s3_response['Contents']:
file_name = item['Key']
logger.debug("Downloading to /tmp/" + file_name)
s3.download_file(bucket_name, file_name, '/tmp/' + file_name)

logger.debug('All files downloaded from S3 and written to local filesystem.')

except Exception as e:
logger.error(e)
raise e

#download the doc store locally
download_docstore()

storage_context = StorageContext.from_defaults(persist_dir="/tmp/")
# load index
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()


def lambda_handler(event, context):
"""
Route the incoming request based on intent.
The JSON body of the request is provided in the event slot.
"""
# By default, treat the user request as coming from the America/New_York time zone.
os.environ['TZ'] = 'America/New_York'
time.tzset()
logger.debug("===== START LEX FULFILLMENT ====")
logger.debug(event)
slots = {}
if "currentIntent" in event and "slots" in event["currentIntent"]:
slots = event["currentIntent"]["slots"]
intent = event["sessionState"]["intent"]

dialogaction = {"type": "Delegate"}
message = []
if str.lower(intent["name"]) == "fallbackintent":
#execute query from the input given by the user
response = str.strip(query_engine.query(event["inputTranscript"]).response)
dialogaction["type"] = "Close"
message.append({'content': f'{response}', 'contentType': 'PlainText'})

final_response = {
"sessionState": {
"dialogAction": dialogaction,
"intent": intent
},
"messages": message
}

logger.debug(json.dumps(final_response, indent=1))
logger.debug("===== END LEX FULFILLMENT ====")

return final_response

This solution works well when a single webpage has all the answers. However, most FAQ sites are not built on a single page. For instance, in our Zappos example, if we ask the question “Do you have a price matching policy?”, then we get a less-than-satisfactory answer, as shown in the following screenshot.

In the preceding interaction, the price-matching policy answer isn’t helpful for our user. This answer is short because the FAQ referenced is a link to a specific page about the price matching policy and our web crawl was only for the single page. Achieving better answers will mean crawling these links as well. The next section shows how to get answers to questions that require two or more levels of page depth.

N-level crawling

When we crawl a web page for FAQ knowledge, the information we want can be contained in linked pages. For example, in our Zappos example, we ask the question “Do you have a price matching policy?” and the answer is “Yes please visit <link> to learn more.” If someone asks “What is your price matching policy?” then we want to give a complete answer with the policy. Achieving this means we have the need to traverse links to get the actual information for our end-user. During the ingestion process, we can use our web loader to find the anchor links to other HTML pages and then traverse them. The following code change to our web crawler allows us to find links in the pages we crawl. It also includes some additional logic to avoid circular crawling and allow a filter by a prefix.

import logging
import requests
import html2text
from llama_index.readers.schema.base import Document
from typing import List
import re


def find_http_urls_in_parentheses(s: str, prefix: str = None):
pattern = r'((https?://[^)]+))'
urls = re.findall(pattern, s)

matched = []
if prefix is not None:
for url in urls:
if str(url).startswith(prefix):
matched.append(url)
else:
matched = urls

return list(set(matched)) # remove duplicates by converting to set, then convert back to list



class EZWebLoader:

def __init__(self, default_header: str = None):
self._html_to_text_parser = html2text
if default_header is None:
self._default_header = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}
else:
self._default_header = default_header

def load_data(self,
urls: List[str],
num_levels: int = 0,
level_prefix: str = None,
headers: str = None) -> List[Document]:

logging.info(f"Number of urls: {len(urls)}.")

if headers is None:
headers = self._default_header

documents = []
visited = {}
for url in urls:
q = [url]
depth = num_levels
for page in q:
if page not in visited: #prevent cycles by checking to see if we already crawled a link
logging.info(f"Crawling {page}")
visited[page] = True #add entry to visited to prevent re-crawling pages
response = requests.get(page, headers=headers).text
response = self._html_to_text_parser.html2text(response) #reduce html to text
documents.append(Document(response))
if depth > 0:
#crawl linked pages
ingest_urls = find_http_urls_in_parentheses(response, level_prefix)
logging.info(f"Found {len(ingest_urls)} pages to crawl.")
q.extend(ingest_urls)
depth -= 1 #reduce the depth counter so we go only num_levels deep in our crawl
else:
logging.info(f"Skipping {page} as it has already been crawled")
logging.info(f"Number of documents: {len(documents)}.")
return documents

url = "http://www.zappos.com/general-questions"
loader = EZWebLoader()
#crawl the site with 1 level depth and prefix of "/c/" for customer service root
documents = loader.load_data([url] 
num_levels=1, level_prefix="https://www.zappos.com/c/")
index = GPTVectorStoreIndex.from_documents(documents)

In the preceding code, we introduce the ability to crawl N levels deep, and we give a prefix that allows us to restrict crawling to only things that begin with a certain URL pattern. In our Zappos example, the customer service pages all are rooted from zappos.com/c, so we include that as a prefix to limit our crawls to a smaller and more relevant subset. The code shows how we can ingest up to two levels deep. Our bot’s Lambda logic remains the same because nothing has changed except the crawler ingests more documents.

We now have all the documents indexed and we can ask a more detailed question. In the following screenshot, our bot provides the correct answer to the question “Do you have a price matching policy?”

We now have a complete answer to our question about price matching. Instead of simply being told “Yes see our policy,” it gives us the details from the second-level crawl.

Clean up

To avoid incurring future expenses, proceed with deleting all the resources that were deployed as part of this exercise. We have provided a script to shut down the Sagemaker endpoint gracefully. Usage details are in the README. Additionally, to remove all the other resources you can run cdk destroy in the same directory as the other cdk commands to deprovision all the resources in your stack.

Conclusion

The ability to ingest a set of FAQs into a chatbot enables your customers to find the answers to their questions with straightforward, natural language queries. By combining the built-in support in Amazon Lex for fallback handling with a RAG solution such as a LlamaIndex, we can provide a quick path for our customers to get satisfying, curated, and approved answers to FAQs. By applying N-level crawling into our solution, we can allow for answers that could possibly span multiple FAQ links and provide deeper answers to our customer’s queries. By following these steps, you can seamlessly incorporate powerful LLM-based Q and A capabilities and efficient URL ingestion into your Amazon Lex chatbot. This results in more accurate, comprehensive, and contextually aware interactions with users.


About the authors

Max Henkel-Wallace is a Software Development Engineer at AWS Lex. He enjoys working leveraging technology to maximize customer success. Outside of work he is passionate about cooking, spending time with friends, and backpacking.

Song Feng is a Senior Applied Scientist at AWS AI Labs, specializing in Natural Language Processing and Artificial Intelligence. Her research explores various aspects of these fields including document-grounded dialogue modeling, reasoning for task-oriented dialogues, and interactive text generation using multimodal data.

John Baker is a Principal SDE at AWS where he works on Natural Language Processing, Large Language Models and other ML/AI related projects. He has been with Amazon for 9+ years and has worked across AWS, Alexa and Amazon.com. In his spare time, John enjoys skiing and other outdoor activities throughout the Pacific Northwest.

Read More

SimPer: Simple self-supervised learning of periodic targets

SimPer: Simple self-supervised learning of periodic targets

Learning from periodic data (signals that repeat, such as a heart beat or the daily temperature changes on Earth’s surface) is crucial for many real-world applications, from monitoring weather systems to detecting vital signs. For example, in the environmental remote sensing domain, periodic learning is often needed to enable nowcasting of environmental changes, such as precipitation patterns or land surface temperature. In the health domain, learning from video measurement has shown to extract (quasi-)periodic vital signs such as atrial fibrillation and sleep apnea episodes.

Approaches like RepNet highlight the importance of these types of tasks, and present a solution that recognizes repetitive activities within a single video. However, these are supervised approaches that require a significant amount of data to capture repetitive activities, all labeled to indicate the number of times an action was repeated. Labeling such data is often challenging and resource-intensive, requiring researchers to manually capture gold-standard temporal measurements that are synchronized with the modality of interest (e.g., video or satellite imagery).

Alternatively, self-supervised learning (SSL) methods (e.g., SimCLR and MoCo v2), which leverage a large amount of unlabeled data to learn representations that capture periodic or quasi-periodic temporal dynamics, have demonstrated success in solving classification tasks. However, they overlook the intrinsic periodicity (i.e., the ability to identify if a frame is part of a periodic process) in data and fail to learn robust representations that capture periodic or frequency attributes. This is because periodic learning exhibits characteristics that are distinct from prevailing learning tasks.

Feature similarity is different in the context of periodic representations as compared to static features (e.g., images). For example, videos that are offset by short time delays or are reversed should be similar to the original sample, whereas videos that have been upsampled or downsampled by a factor x should be different from the original sample by a factor of x.

To address these challenges, in “SimPer: Simple Self-Supervised Learning of Periodic Targets”, published at the eleventh International Conference on Learning Representations (ICLR 2023), we introduced a self-supervised contrastive framework for learning periodic information in data. Specifically, SimPer leverages the temporal properties of periodic targets using temporal self-contrastive learning, where positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. We propose periodic feature similarity that explicitly defines how to measure similarity in the context of periodic learning. Moreover, we design a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). Next, we demonstrate that SimPer effectively learns period feature representations compared to state-of-the-art SSL methods, highlighting its intriguing properties including better data efficiency, robustness to spurious correlations, and generalization to distribution shifts. Finally, we are excited to release the SimPer code repo with the research community.

The SimPer framework

SimPer introduces a temporal self-contrastive learning framework. Positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. For temporal video examples, periodicity-invariant changes are cropping, rotation or flipping, whereas periodicity-variant changes involve increasing or decreasing the speed of a video.

To explicitly define how to measure similarity in the context of periodic learning, SimPer proposes periodic feature similarity. This construction allows us to formulate training as a contrastive learning task. A model can be trained with data without any labels and then fine-tuned if necessary to map the learned features to specific frequency values.

Given an input sequence x, we know there’s an underlying associated periodic signal. We then transform x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo- speed or frequency labels for the unlabeled input x.

Conventional similarity measures such as cosine similarity emphasize strict proximity between two feature vectors, and are sensitive to index shifted features (which represent different time stamps), reversed features, and features with changed frequencies. In contrast, periodic feature similarity should be high for samples with small temporal shifts and or reversed indexes, while capturing a continuous similarity change when the feature frequency varies. This can be achieved via a similarity metric in the frequency domain, such as the distance between two Fourier transforms.

To harness the intrinsic continuity of augmented samples in the frequency domain, SimPer designs a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). This makes it suitable for regression tasks, where the goal is to recover a continuous signal, such as a heart beat.

SimPer constructs negative views of data through transformations in the frequency domain. The input sequence x has an underlying associated periodic signal. SimPer transforms x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo speed or frequency labels for unlabeled input x (periodicity-variant augmentations τ). SimPer takes transformations that do not change the identity of the input and defines these as periodicity-invariant augmentations σ, thus creating different positive views of the sample. Then, it sends these augmented views to the encoder f, which extracts corresponding features.

Results

To evaluate SimPer’s performance, we benchmarked it against state-of-the-art SSL schemes (e.g., SimCLR, MoCo v2, BYOL, CVRL) on a set of six diverse periodic learning datasets for common real-world tasks in human behavior analysis, environmental remote sensing, and healthcare. Specifically, below we present results on heart rate measurement and exercise repetition counting from video. The results show that SimPer outperforms the state-of-the-art SSL schemes across all six datasets, highlighting its superior performance in terms of data efficiency, robustness to spurious correlations, and generalization to unseen targets.

Here we show quantitative results on two representative datasets using SimPer pre-trained using various SSL methods and fine-tuned on the labeled data. First, we pre-train SimPer using the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) dataset, a human photoplethysmography and heart rate prediction dataset, and compare its performance to state-of-the-art SSL methods. We observe that SimPer outperforms SimCLR, MoCo v2, BYOL, and CVRL methods. The results on the human action counting dataset, Countix, further confirm the benefits of SimPer over others methods as it notably outperforms the supervised baseline. For the feature evaluation results and performance on other datasets, please refer to the paper.

Results of SimCLR, MoCo v2, BYOL, CVRL and SimPer on the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) and Countix datasets. Heart rate and repetition count performance is reported as mean absolute error (MAE).

Conclusion and applications

We present SimPer, a self-supervised contrastive framework for learning periodic information in data. We demonstrate that by combining a temporal self-contrastive learning framework, periodicity-invariant and periodicity-variant augmentations, and continuous periodic feature similarity, SimPer provides an intuitive and flexible approach for learning strong feature representations for periodic signals. Moreover, SimPer can be applied to various fields, ranging from environmental remote sensing to healthcare.

Acknowledgements

We would like to thank Yuzhe Yang, Xin Liu, Ming-Zher Poh, Jiang Wu, Silviu Borac, and Dina Katabi for their contributions to this work.

Read More