Build RAG-based generative AI applications in AWS using Amazon FSx for NetApp ONTAP with Amazon Bedrock

Build RAG-based generative AI applications in AWS using Amazon FSx for NetApp ONTAP with Amazon Bedrock

The post is co-written with Michael Shaul and Sasha Korman from NetApp.

Generative artificial intelligence (AI) applications are commonly built using a technique called Retrieval Augmented Generation (RAG) that provides foundation models (FMs) access to additional data they didn’t have during training. This data is used to enrich the generative AI prompt to deliver more context-specific and accurate responses without continuously retraining the FM, while also improving transparency and minimizing hallucinations.

In this post, we demonstrate a solution using Amazon FSx for NetApp ONTAP with Amazon Bedrock to provide a RAG experience for your generative AI applications on AWS by bringing company-specific, unstructured user file data to Amazon Bedrock in a straightforward, fast, and secure way.

Our solution uses an FSx for ONTAP file system as the source of unstructured data and continuously populates an Amazon OpenSearch Serverless vector database with the user’s existing files and folders and associated metadata. This enables a RAG scenario with Amazon Bedrock by enriching the generative AI prompt using Amazon Bedrock APIs with your company-specific data retrieved from the OpenSearch Serverless vector database.

When developing generative AI applications such as a Q&A chatbot using RAG, customers are also concerned about keeping their data secure and preventing end-users from querying information from unauthorized data sources. Our solution also uses FSx for ONTAP to allow users to extend their current data security and access mechanisms to augment model responses from Amazon Bedrock. We use FSx for ONTAP as the source of associated metadata, specifically the user’s security access control list (ACL) configurations attached to their files and folders and populate that metadata into OpenSearch Serverless. By combining access control operations with file events that notify the RAG application of new and changed data on the file system, our solution demonstrates how FSx for ONTAP enables Amazon Bedrock to only use embeddings from authorized files for the specific users that connect to our generative AI application.

AWS serverless services make it straightforward to focus on building generative AI applications by providing automatic scaling, built-in high availability, and a pay-for-use billing model. Event-driven compute with AWS Lambda is a good fit for compute-intensive, on-demand tasks such as document embedding and flexible large language model (LLM) orchestration, and Amazon API Gateway provides an API interface that allows for pluggable frontends and event-driven invocation of the LLMs. Our solution also demonstrates how to build a scalable, automated, API-driven serverless application layer on top of Amazon Bedrock and FSx for ONTAP using API Gateway and Lambda.

Solution overview

The solution provisions an FSx for ONTAP Multi-AZ file system with a storage virtual machine (SVM) joined to an AWS Managed Microsoft AD domain. An OpenSearch Serverless vector search collection provides a scalable and high-performance similarity search capability. We use an Amazon Elastic Compute Cloud (Amazon EC2) Windows server as an SMB/CIFS client to the FSx for ONTAP volume and configure data sharing and ACLs for the SMB shares in the volume. We use this data and ACLs to test permissions-based access to the embeddings in a RAG scenario with Amazon Bedrock.

The embeddings container component of our solution is deployed on an EC2 Linux server and mounted as an NFS client on the FSx for ONTAP volume. It periodically migrates existing files and folders along with their security ACL configurations to OpenSearch Serverless. It populates an index in the OpenSearch Serverless vector search collection with company-specific data (and associated metadata and ACLs) from the NFS share on the FSx for ONTAP file system.

The solution implements a RAG Retrieval Lambda function that allows RAG with Amazon Bedrock by enriching the generative AI prompt using Amazon Bedrock APIs with your company-specific data and associated metadata (including ACLs) retrieved from the OpenSearch Serverless index that was populated by the embeddings container component. The RAG Retrieval Lambda function stores conversation history for the user interaction in an Amazon DynamoDB table.

End-users interact with the solution by submitting a natural language prompt either through a chatbot application or directly through the API Gateway interface. The chatbot application container is built using Streamlit and fronted by an AWS Application Load Balancer (ALB). When a user submits a natural language prompt to the chatbot UI using the ALB, the chatbot container interacts with the API Gateway interface that then invokes the RAG Retrieval Lambda function to fetch the response for the user. The user can also directly submit prompt requests to API Gateway and obtain a response. We demonstrate permissions-based access to the RAG documents by explicitly retrieving the SID of a user and then using that SID in the chatbot or API Gateway request, where the RAG Retrieval Lambda function then matches the SID to the Windows ACLs configured for the document. As an additional authentication step in a production environment, you may want to also authenticate the user against an identity provider and then match the user against the permissions configured for the documents.

The following diagram illustrates the end-to-end flow for our solution. We start by configuring data sharing and ACLs with FSx for ONTAP, and then these are periodically scanned by the embeddings container. The embeddings container splits the documents into chunks and uses the Amazon Titan Embeddings model to create vector embeddings from these chunks. It then stores these vector embeddings with associated metadata in our vector database by populating an index in a vector collection in OpenSearch Serverless. The following diagram illustrates the end-to-end flow.

end to end embedding flow for the fsxontap and bedrock integration

The following architecture diagram illustrates the various components of our solution.overall architecture diagram describing all the components of the solution

Prerequisites

Complete the following prerequisite steps:

  1. Make sure you have model access in Amazon Bedrock. In this solution, we use Anthropic Claude v3 Sonnet on Amazon Bedrock.
  2. Install the AWS Command Line Interface (AWS CLI).
  3. Install Docker.
  4. Install Terraform.

Deploy the solution

The solution is available for download on this GitHub repo. Cloning the repository and using the Terraform template will provision all the components with their required configurations.

  1. Clone the repository for this solution:
    sudo yum install -y unzip
    git clone https://github.com/aws-samples/genai-bedrock-fsxontap.git
    cd genai-bedrock-fsxontap/terraform

  2. From the terraform folder, deploy the entire solution using Terraform:
    terraform init
    terraform apply -auto-approve

This process can take 15–20 minutes to complete. When finished, the output of the terraform commands should look like the following:

api-invoke-url = "https://9ng1jjn8qi.execute-api.<region>.amazonaws.com/prod"
fsx-management-ip = toset([
"198.19.255.230",])
fsx-secret-id = "arn:aws:secretsmanager:<region>:<account-id>:secret:AmazonBedrock-FSx-NetAPP-ONTAP-a2fZEdIt-0fBcS9"
fsx-svm-smb-dns-name = "BRSVM.BEDROCK-01.COM"
lb-dns-name = "chat-load-balancer-2040177936.<region>.elb.amazonaws.com"

Load data and set permissions

To test the solution, we will use the EC2 Windows server (ad_host) mounted as an SMB/CIFS client to the FSx for ONTAP volume to share sample data and set user permissions that will then be used to populate the OpenSearch Serverless index by the solution’s embedding container component. Perform the following steps to mount your FSx for ONTAP SVM data volume as a network drive, upload data to this shared network drive, and set permissions based on Windows ACLs:

  1. Obtain the ad_host instance DNS from the output of your Terraform template.
  2. Navigate to AWS Systems Manager Fleet Manager on your AWS console, locate the ad_host instance and follow instructions here to login with Remote Desktop. Use the domain admin user bedrock-01Admin and obtain the password from AWS Secrets Manager. You can find the password using the Secrets Manager fsx-secret-id secret id from the output of your Terraform template.
  3. To mount an FSx for ONTAP data volume as a network drive, under This PC, choose (right-click) Network and then choose Map Network drive.
  4. Choose the drive letter and use the FSx for ONTAP share path for the mount
    (\<svm>.<domain >c$<volume-name>):
    map network drive
  5. Upload the Amazon Bedrock User Guide to the shared network drive and set permissions to the admin user only (make sure that you disable inheritance under Advanced):upload the amazon bedrock user guide
  6. Upload the Amazon FSx for ONTAP User Guide to the shared drive and make sure permissions are set to Everyone:upload the amazon fsx ontap media guide
  7. On the ad_host server, open the command prompt and enter the following command to obtain the SID for the admin user:
    wmic useraccount where name='Admin' get sid

Test permissions using the chatbot

To test permissions using the chatbot, obtain the lb-dns-name URL from the output of your Terraform template and access it through your web browser:

test with chatbot and enter prompt

For the prompt query, ask any general question on the FSx for ONTAP user guide that is available for access to everyone. In our scenario, we asked “How can I create an FSx for ONTAP file system,” and the model replied back with detailed steps and source attribution in the chat window to create an FSx for ONTAP file system using the AWS Management Console, AWS CLI, or FSx API:

test with chatbot and enter prompt related to the bedrock guide

Now, let’s ask a question about the Amazon Bedrock user guide that is available for admin access only. In our scenario, we asked “How do I use foundation models with Amazon Bedrock,” and the model replied with the response that it doesn’t have enough information to provide a detailed answer to the question.:

Use the admin SID on the user (SID) filter search in the chat UI and ask the same question in the prompt. This time, the model should reply with steps detailing how to use FMs with Amazon Bedrock and provide the source attribution used by the model for the response:

Test permissions using API Gateway

You can also query the model directly using API Gateway. Obtain the api-invoke-url parameter from the output of your Terraform template.

curl -v '<api-invoke-url>/bedrock_rag_retreival' -X POST -H 'content-type: application/json' -d '{"session_id": "1","prompt": "What is an FSxN ONTAP filesystem?", "bedrock_model_id": "anthropic.claude-3-sonnet-20240229-v1:0", "model_kwargs": {"temperature": 1.0, "top_p": 1.0, "top_k": 500}, "metadata": "NA", "memory_window": 10}'

Then invoke the API gateway with Everyone access for a query related to the FSx for ONTAP user guide by setting the value of the metadata parameter to NA to indicate Everyone access:

curl -v '<api-invoke-url>/bedrock_rag_retreival' -X POST -H 'content-type: application/json' -d '{"session_id": "1","prompt": "what is bedrock?", "bedrock_model_id": "anthropic.claude-3-sonnet-20240229-v1:0", "model_kwargs": {"temperature": 1.0, "top_p": 1.0, "top_k": 500}, "metadata": "S-1-5-21-4037439088-1296877785-2872080499-1112", "memory_window": 10}'

Cleanup

To avoid recurring charges, clean up your account after trying the solution. From the terraform folder, delete the Terraform template for the solution:

terraform apply --destroy

Conclusion

In this post, we demonstrated a solution that uses FSx for ONTAP with Amazon Bedrock and uses FSx for ONTAP support for file ownership and ACLs to provide permissions-based access in a RAG scenario for generative AI applications. Our solution enables you to build generative AI applications with Amazon Bedrock where you can enrich the generative AI prompt in Amazon Bedrock with your company-specific, unstructured user file data from an FSx for ONTAP file system. This solution enables you to deliver more relevant, context-specific, and accurate responses while also making sure only authorized users have access to that data. Finally, the solution demonstrates the use of AWS serverless services with FSx for ONTAP and Amazon Bedrock that enable automatic scaling, event-driven compute, and API interfaces for your generative AI applications on AWS.

For more information about how to get started building with Amazon Bedrock and FSx for ONTAP, refer to the following resources:


About the authors

Kanishk Mahajan is Principal, Solutions Architecture at AWS. He leads cloud transformation and solution architecture for ISV customers and partner at AWS. Kanishk specializes in containers, cloud operations, migrations and modernizations, AI/ML, resilience and security and compliance. He is a Technical Field Community (TFC) member in each of those domains at AWS.

Michael Shaul is a Principal Architect at NetApp’s office of the CTO. He has over 20 years of experience building data management systems, applications, and infrastructure solutions. He has a unique in-depth perspective on cloud technologies, builder, and AI solutions.

Sasha Korman is a tech visionary leader of dynamic development and QA teams across Israel and India. With 14-years at NetApp that began as a programmer, his hands-on experience and leadership have been pivotal in steering complex projects to success, with a focus on innovation, scalability, and reliability.

Read More

Support for AWS DeepComposer ending soon

Support for AWS DeepComposer ending soon

AWS DeepComposer was first introduced during AWS re:Invent 2019 as a fun way for developers to compose music by using generative AI. AWS DeepComposer was the world’s first machine learning (ML)-enabled keyboard for developers to get hands-on—literally—with a musical keyboard and the latest ML techniques to compose their own music.

After careful consideration, we have made the decision to end support for AWS DeepComposer, effective September 17, 2025. With your help and feedback, our portfolio of products and services has grown to include new tools for developers to get hands-on with AI and ML. Amazon PartyRock, for example, is a generative AI playground for intuitive, code-free help in building web applications.

If you have data stored on the AWS DeepComposer console, you will be able to use AWS DeepComposer as normal until September 17, 2025, when support for the service will end. After this date, you will no longer be able to use AWS DeepComposer through the AWS Management Console, manage AWS DeepComposer devices, or access any compositions or models you have created. Until then, you can continue to work on your compositions or models and export those you would like to keep by using the step-by-step guide in the AWS DeepComposer FAQs.

If you have additional questions, please read our FAQs or contact us.


About the author

Kanchan Jagannathan is a Sr. Program Manager in the AWS AI Devices team where he helps launches AWS devices into sales channel and also oversees the Service Availability Change process for the team. He was a Program Manager for FC automation deployment and launches before joining AWS. Outside of work, he has begun to bravely endeavour camping with his 5-yr old and 1-yr old kids and enjoying the moments he gets to be with them.

Read More

Preserve access and explore alternatives for Amazon Lookout for Equipment

Preserve access and explore alternatives for Amazon Lookout for Equipment

Amazon Lookout for Equipment, the AWS machine learning (ML) service designed for industrial equipment predictive maintenance, will no longer be open to new customers effective October 17, 2024. Existing customers will be able to use the service (both using the AWS Management Console and API) as normal and AWS will continue to invest in security, availability, and performance improvements for Lookout for Equipment, but we do not plan to introduce new features for this service.

This post discusses how you can maintain access to Lookout for Equipment after it is closed to new customers and some alternatives to Lookout for Equipment.

Maintaining access to Lookout for Equipment

You’re considered an existing customer if you use the service, either through cloud training or cloud inferencing, any time in the 30 days prior to October 17, 2024 (September 17, 2024, through October 16, 2024). To maintain access to the service after October 17, 2024, you should complete one of the following tasks from the account for which you intend to maintain access:

  • On the Lookout for Equipment console, start a new project and successfully complete a model training
  • On the Lookout for Equipment console, open an existing project, schedule an inference for a given model, and run at least one inference
  • Use Lookout for Equipment API calls CreateInferenceScheduler and StartInferenceScheduler (and StopInferenceScheduler when done)

For any questions or support needed, contact your assigned AWS Account Manager or Solutions Architect, or create a case from the AWS console.

Alternatives to Lookout for Equipment

If you’re interested in an alternative to Lookout for Equipment, AWS has options for both buyers and builders.

For an out-of-the-box solution, the AWS Partner Network offers solutions from multiple partners. You can browse solutions on the Asset Maintenance and Reliability page in the AWS Solutions Library. This approach provides a solution that addresses your use case without requiring you to have expertise in predictive maintenance, and typically provides the fastest time to value by using the specialized expertise of the AWS Partners.

If you prefer to build your own solution, AWS offers AI/ML tools and services to help you develop an AI-based predictive maintenance solution. Amazon SageMaker provides a set of tools to enable you to build, train, infer, and deploy ML models for your use case with fully managed infrastructure, tools, and workflows.

Summary

Although new customers will no longer have access to Lookout for Equipment after October 17, 2024, AWS offers a powerful set of AI/ML services and solutions in the form of SageMaker tools to build customer models, and also offers a range of solutions from partners through the AWS Partner Network. You should explore these options to determine what works best for your specific needs.

For more details, refer to the following resources:


About the author

Stuart Gillen is a Sr. Product Manager, Lookout for Equipment, at AWS. Stuart has held a variety of roles in engineering management, business development, product management, and consulting. Most of his career has been focused on industrial applications specifically in reliability practices, maintenance systems, and manufacturing.
Stuart is the Product Manager for Lookout for Equipment at AWS where he utilizes his industrial and AI background in applications focusing on Predictive Maintenance and Condition Monitoring.

Read More

Eureka: Evaluating and understanding progress in AI

Eureka: Evaluating and understanding progress in AI

A summary of insights extracted by using the Eureka framework, shown via two radar charts for multimodal (left) and language (right) capabilities respectively. The radar charts show the best and worst performance observed for each capability.

In the fast-paced progress of AI, the question of how to evaluate and understand capabilities of state-of-the-art models is timelier than ever. New and capable models are being released frequently, and each release promises the next big leap in frontiers of intelligence. Yet, as researchers and developers, often we ask ourselves: Are these models all comparable, if not the same, in terms of capabilities? There are, of course, strong reasons to believe they are, given that many score similarly in standard benchmarks. In addition, rankings in the numerous leaderboards do not offer a consistent and detailed explanation of why a model is ranked slightly better than others. However, if some models are fundamentally different, what are their strengths and weaknesses? More importantly, are there capabilities that are essential for making AI useful in the real world but still universally challenging for most models? Answering such questions helps us understand where we are on the frontier of AI, and what capability improvements are needed to meet the expectations that humanity and science have for safe and responsible deployments of AI models. 

The prevalence of these models is dependent on our ability to mature the science of in-depth AI evaluation and measurement. In our latest open-source release and technical report EUREKA: Evaluating and Understanding Large Foundation Models (opens in new tab), we start answering these questions by running an in-depth measurement analysis across 12 state-of-the-art proprietary and open-weights models. Behind this analysis stands Eureka (opens in new tab), an open-source framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. The framework currently supports both language and multimodal (text and image) data and enables developers to define custom pipelines for data processing, inference, and evaluation, with the possibility to inherit from existing pipelines and minimize development work. Eureka and all our evaluation pipelines are available as open source to foster transparent and reproducible evaluation practices. We hope to collaborate with the open-source community to share and expand current measurements for new capabilities and models. 

Focus on challenging and non-saturated capabilities

Eureka tests models across a rich collection of fundamental language and multimodal capabilities that are challenging for even the most advanced models, but are often overlooked by standard benchmarks commonly reported in model releases. In practice, this also means that our analysis intentionally does not pivot on oversaturated benchmarks. As unconventional as this may sound, it is motivated by two reasons. First, measurement on saturated benchmarks, for which most models perform over 95%, leaves very little space for failure analysis and model comparison. Second, even though saturation may be rooted in genuine model improvements, concerns about memorization and overfitting to labeling errors lower the credibility of measurements, especially in the very high accuracy regime. 

Microsoft Research blog

Microsoft at FAccT 2024: Advancing responsible AI research and practice

From studying how to identify gender bias in Hindi to uncovering AI-related risks for workers, Microsoft is making key contributions towards advancing the state of the art in responsible AI research. Check out their work at ACM FAccT 2024.


Beyond single-score measurements and universal rankings

Even though rankings and leaderboards remain the quickest way to compare models, they rarely uncover important conditions of failure. Due to overreliance on single-score aggregations of performance, the more nuanced comparative findings are hidden behind small differences between model scores aggregated across many capabilities and experimental conditions.

As we show in our study, the chase after these rankings has created surprising dynamics that do not necessarily lead to identical models, but to models that use different complementary skills to achieve comparable overall scores in important leaderboards. Imagine you are a triathlon athlete aiming to achieve an elite performance, which historically takes around two hours. Despite your ambition to hit this top-tier mark, you face constraints with limited time and resources for training and preparation. In practice, athletes often focus their best resources on excelling in certain disciplines while aiming for a satisfactory performance in others. They prioritize based on what they believe is most achievable given their time and experience.

We observe similar phenomena in the set of 12 models we study. Even if two models may score very closely for the same capability, disaggregating that performance across disciplines and input conditions shows that each model has its own complementary strengths. Identifying, measuring, and understanding these strengths for a single model is needed for planning targeted improvements. Repeating this process for a large set of models, as we do in Eureka, is needed for identifying the hypothetical frontier, guiding research and development, and creating a model that combines and delivers capabilities that build on the strengths observed in existing models. 

Measuring consistency: non-determinism and backward compatibility

When people work with collaborators or when they choose tools to assist them in everyday tasks, predictability and consistency are key to a successful collaboration. Similarly, humans and application developers expect their AI assistants and models to be consistent over time for similar inputs and interactions. In our analysis, we study this under-explored angle of model performance, by focusing on two key aspects: the determinism of answer outcomes for identical examples and prompts, and the backward compatibility of model answers at the example level after a model has been updated with a new version. Lack of consistency in either of these domains would lead to breaking trust with users and application developers. 

The analysis shows surprising results and opens new considerations for improvement. For example, we observe that very few large foundation models are fully deterministic and for most of them there are visible variations in the output — and most importantly in accuracy — when asked the same question several times, with generation temperature set to zero—a control that tells models to minimize randomness in generations. In addition, when comparing new model releases with earlier models from the same family, a significant amount of regress at the example level can be observed after the update, even though the overall accuracy may increase. In practice, this type of inconsistency can be frustrating for application developers who rely on prewritten examples and prompts propagated to a foundation model. 

Eureka Insights

Figure 1 is a high-level illustration of the current state of AI for Eureka-Bench, highlighting the best and the worst performances across various capabilities. These results reveal a nuanced picture of different models’ strengths, showing that no single model excels in all tasks. However, Claude 3.5 Sonnet, GPT-4o 2024-05-13, and Llama 3.1 405B consistently outperform others in several key areas.

A summary of insights extracted by using the Eureka framework, shown via two radar charts for multimodal (left) and language (right) capabilities respectively. The radar charts show the best and worst performance observed for each capability.
Figure 1 – Performance of best and worse models for multimodal (left) and language (right) datasets in in Eureka-Bench. The red frontier shows the performance of the worse model, indicating the area that is already solved for the set of capabilities. The green frontier shows the performance of the best model, indicating the best-known result with current technology. The blue horizon between the best model and the maximum performance shows the room for improvement for mastering the capability. The best performance sets indicated in the green border include all models that perform within 2% of the best observed result. 

Multimodal capabilities

Evaluation in Eureka reveals that state-of-the-art models are still fairly limited in their multimodal abilities, specifically when it comes to detailed image understanding (for example, localization of objects, geometric and spatial reasoning, and navigation), which is most needed in truly multimodal scenarios that require physical awareness, visual grounding, and localization. 

  1. State-of-the-art multimodal models struggle with geometric reasoning. 
    Models perform worse in reasoning about height than about depth. Claude 3.5 Sonnet and Gemini 1.5 Pro are the best performing models for this task, with Claude 3.5 Sonnet being the most accurate model for depth ordering, and Gemini 1.5 Pro the most accurate for height ordering. 
  2. Multimodal capabilities lag language capabilities. 
    On tasks that can be described either as multimodal or as language-only, the performance of most tested models is higher for the language-only condition. GPT-4o 2024-05-13 is the only model that consistently achieves better results when presented with both vision and language information, showing therefore that it can better fuse the two data modalities.
  3. Complementary performance across models for fundamental multimodal skills.
    Claude 3.5 Sonnet, GPT-4o 2024-05-13, and GPT-4 Turbo 2024-04-09 have comparable performance in multimodal question answering (MMMU). In tasks like object recognition and visual prompting, the performance of Claude 3.5 Sonnet is better or comparable to GPT-4o 2024-05-13, but Gemini 1.5 Pro outperforms them both. Finally, in tasks like object detection and spatial reasoning, GPT-4o 2024-05-13 is the most accurate model. 

Language

The evaluation through Eureka shows that there have been important advances from state-of-the-art models in the language capabilities of instruction following, long context question answering, information retrieval, and safety. The analysis also discovers major differences and gaps between models related to robustness to context length, factuality and grounding for information retrieval, and refusal behavior. 

  1. Faster improvements in instruction following across all model families. 
    Instruction following is the ability to follow guidance expressed in user prompts regarding specifications related to format, style, and structure of the generated content. Among the studied language capabilities, instruction following is where most models are improving faster, potentially due to strong investments in instruction tuning processes, with most models now having an instruction following rate of higher than 75%. 
  2. All models’ performance in question answering drops with longer context. 
    Contrary to “needle-in-a-haystack” experiments, testing state-of-the-art models on tasks that involve reasoning over long context shows significant decline in performance as context size grows. Amongst all models, GPT-4o 2024-05-13 and Llama 3.1 405B have the lowest drop in performance for longer context.
  3. Major gaps in factuality and grounding for information retrieval from parametric knowledge or input context. 
    Models exhibit query fact precision rates of lower than 55%, fact recall rates of lower than 25%, and rates of irrelevant and fabricated information above 20%. Llama 3.1 405B, GPT-4o 2024-05-13, and Claude 3.5 Sonnet are the top performers in this area across different conditions.
  4. High refusal rates. Lower accuracy in detecting toxic content vs. neutral content for most models. 
    While several models have high accuracy rates for toxicity detection, others (Gemini 1.5 Pro, Claude 3.5 Sonnet, Claude 3 Opus, and Llama 3.1 405B) exhibit low accuracy in classifying toxic content and a high refusal rate to classify toxic or neutral context, both of which make toxic content difficult to detect. During the safe language generation evaluation, models like GPT-4 1106 Preview and Mistral Large 2407 have the highest toxicity rates. GPT-4o 2024-05-13 is the only model that has both a high toxicity detection accuracy and a low toxicity score for safe language generation. 

Non-determinism

Several models have highly non-deterministic output for identical runs. Gemini 1.5 Pro, GPT-4 1106 Preview, GPT-4 Vision Preview, and GPT-4 Turbo 2024-04-09 show high non-determinism of outcomes. These results raise important questions regarding the stability of user and developer experiences when repeatedly inferencing with identical queries using the same prompt templates. Llama 3 70B, Llama 3.1 70B, and Mistral Large 2407 are almost perfectly deterministic. 

Backward compatibility

Backward incompatibility for shifts within the same model family is prevalent across all state-of-the-art models. This is reflected in high regression rates for individual examples and at a subcategory level. This type of regression can break trust with users and application developers during model updates. Regression varies per task and metric, but we observe several cases when it is higher than 10% across three model families (Claude, GPT, Llama), and sometimes they can dominate progress rates for whole subcategories of data. 

Conclusion

The complementary results extracted from this study highlight opportunities for improving current models across various areas, aiming to match the performance of the best model for each individual capability in this challenge set. However, several tasks in the challenge set remain difficult even for the most capable models. It is crucial to discuss and explore whether these gaps can be addressed with current technologies, architectures, and data synthesis protocols.

Finally, Eureka and the set of associated benchmarks are only the initial snapshot of an effort that aims at reliably measuring progress in AI. Our team is excited about further collaborations with the open-source community and research, with the goal of sharing and extending current measurements for new capabilities and models. 

The post Eureka: Evaluating and understanding progress in AI appeared first on Microsoft Research.

Read More

Upgrade Livestreams With Twitch Enhanced Broadcasting and the NVIDIA Encoder

Upgrade Livestreams With Twitch Enhanced Broadcasting and the NVIDIA Encoder

At TwitchCon — a global convention for the Twitch livestreaming platform—livestreamers and content creators this week can experience the latest technologies for accelerating creative workflows and improving video quality.

That includes the beta release of Twitch Enhanced Broadcasting support for HEVC when using the NVIDIA encoder.

Content creators can also use the NVIDIA Broadcast app, eighth-generation NVIDIA NVENC and RTX-powered optimizations in streaming and video editing apps to enhance their productions.

Plus, the September NVIDIA Studio Driver, designed to optimize creative apps, is now ready for download. Studio Drivers undergo extensive testing to ensure seamless compatibility while enhancing features, automating processes and accelerating workflows.

Twitch Enhanced Broadcasting With HEVC

The tradeoff between higher-resolution video quality and reliable streaming is a common issue livestreamers struggle with.

Higher-quality video provides more enjoyable viewing experiences but can cause streams to buffer for viewers with lower bandwidth or older devices. Streaming lower-bitrate video allows more people to watch content seamlessly but introduces artifacts that can interfere with viewing quality.

To address this issue, NVIDIA and Twitch collaborated to develop Twitch Enhanced Broadcasting. The feature adds the capability to send multiple streams — different versions of encoded video with different resolutions or bitrates — directly from NVIDIA GeForce RTX-equipped PCs or NVIDIA RTX workstations to deliver the highest-quality video a viewer’s internet connection can handle.

Twitch supports HEVC (H.265) in the Enhanced Broadcasting closed beta. With the NVIDIA encoder, Twitch streamers get 25% improved efficiency and quality over H.264.

This means that video will look as if it were being streamed with 25% more bitrate — in higher quality and with reduced artifacts or encoding errors. The feature is ideal for streaming fast-paced gameplay, enabling cleaner, sharper video with minimal lag.

Because all stream versions are generated with a dedicated hardware encoder on GeForce RTX GPUs, the rest of the system’s GPU and CPU are free to focus on running games more smoothly to maximize performance.

Learn how to get started on twitch.com.

AI-Enhanced Microphones and Webcams

Streaming is easier than ever with NVIDIA technologies.

For starters, PC performance and video quality are incredibly high quality thanks to NVIDIA’s dedicated encoder. And, NVIDIA GPUs include Tensor Cores that efficiently run AI.

Livestreamers can use AI to enhance their hardware peripherals and devices, which is especially helpful for those who haven’t had the time or resources to assemble extensive audio and video setups.

NVIDIA Broadcast transforms any home office or dorm room into a home studio — without the need to purchase specialized equipment. Its AI-powered features include Noise and Echo Removal for microphones, and Virtual Background, Auto Frame, Video Noise Removal and Eye Contact for cameras.

Livestreamers can download the Broadcast app or access its effects across popular creative apps, including Corsair iCUE, Elgato Camera Hub, OBS, Streamlabs, VTube Studio and Wave Link.

Spotlight the Highlights

GeForce RTX GPUs make it lightning-fast to edit and enhance video footage on the most popular video editing apps, from Adobe Premiere Pro to CapCut Pro.

Streamers can use AI-powered, RTX-accelerated features like Enhance Speech to remove noise and improve the quality of dialogue clips; Auto Reframe to automatically size social media videos; and Scene Edit Detection to break up long videos, like B-roll stringouts, into individual clips.

NVIDIA encoders help turbocharge the export process. For those looking for extreme performance, the GeForce RTX 4070 Ti GPU and up come equipped with dual encoders that can be used in parallel to halve export times on apps like CapCut, the most widely used video editing app on TikTok.

Clearer, Sharper Viewing Experiences With RTX Video

NVIDIA RTX Video — available exclusively for NVIDIA and GeForce RTX GPU owners — can turn any online and native video into pristine 4K high dynamic range (HDR) content with two technologies: Video Super Resolution and Video HDR.

RTX Video Super Resolution de-artifacts and upscales streamed video to remove errors that occur during encoding or transport, then runs an AI super-resolution effect. The result is cleaner, sharper video that’s ideal for streaming on platforms like YouTube and Twitch.

Many users have HDR displays, but there isn’t much HDR content online. RTX Video HDR addresses this by turning any standard dynamic range (SDR) video into HDR10 quality that delivers a wider range of brights and darks and makes visuals more vibrant and colorful. This feature is especially helpful when watching dark-lit scenes in video games.

RTX Video HDR requires an RTX GPU connected to an HDR10-compatible monitor or TV. For more information, see the RTX Video FAQ.

Check out TwitchCon — taking place in San Diego and online from Sept. 20-22 for the latest streaming updates. 

Read More

New AI Innovation Hub in Tunisia Drives Technological Advancement Across Africa

New AI Innovation Hub in Tunisia Drives Technological Advancement Across Africa

A new AI innovation hub for developers across Tunisia launched today in Novation City, a technology park that’s designed to cultivate a vibrant, innovation ecosystem in mechatronics — an industry encompassing IT, mechanics and electronics — and to foster synergy between education, research and industry in the North African country.

Built in collaboration with the NVIDIA Deep Learning Institute (DLI), the hub offers the training, technologies and business networks needed to help drive AI adoption across the continent.

The hub’s launch is part of NVIDIA’s efforts to train 100,000 developers across Africa through the DLI over the next three years — a goal that’s about a quarter complete today.

Located in Sousse — a coastal city in central Tunisia that’s surrounded by universities, startups and other organizations with a strong focus on STEM and AI — the hub includes complimentary access to NVIDIA DLI courses on topics such as generative AI, accelerated computing and data science.

Through its AI, industry 4.0 and smart transport centers of excellence, Novation City provides cutting-edge resources and access to NVIDIA DGX infrastructure for AI startups and researchers. Novation City also hosts organized activities to drive ecosystem growth, such as hackathons and specialized training sessions. It’s all brought together in this new environment conducive to AI learning, experimentation and deployment.

“Novation City has launched several key AI initiatives to strengthen the ecosystem, with NVIDIA’s support being instrumental in empowering AI startups and advancing AI skills,” said Anas Rochdi, chief innovation officer at Novation City. “This year, we deployed Tunisia’s first NVIDIA DGX system and launched major academic initiatives in collaboration with the NVIDIA Deep Learning Institute, aiming to train more than 1,000 developers in one year.”

Novation City also runs several startup accelerator programs, and more than 10 participating companies are members of NVIDIA Inception, a free program that nurtures cutting-edge startups.

Tunisia Drives Innovation in STEM and AI Education

Tunisia’s education system has traditionally emphasized STEM, especially mathematics and the sciences, according to Wei Xiao, who leads NVIDIA’s developer relations team for startups, enterprises and universities in the Middle East and Africa.

“Tunisia has a rich history of valuing knowledge and scholarship, dating back to ancient Carthage and the Islamic Golden Age,” Xiao said. “And the nation’s curriculum is rigorous, creating a solid foundation for advanced studies in STEM fields.”

This has made the country — well-situated to serve as a gateway between Europe, Africa and the Middle East (EMEA) — a thriving ecosystem for innovation, entrepreneurship and research. Novation City and NVIDIA aim to bolster the ecosystem even further through the new AI innovation hub.

Fostering collaboration between academia, industry and government, the initiative is funded by French and German development agencies, the Tunisian government, the World Bank and the European Union, as well as several enterprises.

And the hub comes at a time when Tunisia has adopted a national strategy for AI and digitalization — which includes promoting AI education and research — as part of a broader vision to position the nation as a digital leader in Africa. For example, the University of Tunis this month launched the nation’s first public institute specializing in AI.

Partners Foster Sovereign AI in Tunisia and Beyond

A key part of sovereign AI is a nation’s ability to produce artificial intelligence using its own workforce — along with its own infrastructure, data and business networks. Free DLI training offered through Tunisia’s AI innovation hub is poised to enable just that, helping upskill the next generation of African AI experts.

Plus, Novation City already offers a wide range of facilities designed to support technological and scientific advancement.

In February, Novation City deployed an NVIDIA DGX system, among the first in Africa, that has empowered about 30 startups across the continent in climate AI, transportation, manufacturing, agtech and other industries to develop accelerated computing-based solutions.

In addition, ESPRIT University — a specialized university in Tunisia with more than 10,000 engineering students — boasts nine NVIDIA DLI ambassadors who are delivering training to students and contributing to the broader tech ecosystem across the country. This makes ESPRIT University one of the most active DLI organizations across EMEA.

Since 2018, ESPRIT has been tapping into DLI to advance AI education. The university has also acquired an NVIDIA DGX system to support research and product development.

NVIDIA has planned similar AI education initiatives in Kenya and Nigeria to further upskill and enhance African technology ecosystems.

Learn more about the NVIDIA Deep Learning Institute.

Read More

CRISPR-Cas9 guide RNA efficiency prediction with efficiently tuned models in Amazon SageMaker

CRISPR-Cas9 guide RNA efficiency prediction with efficiently tuned models in Amazon SageMaker

The clustered regularly interspaced short palindromic repeat (CRISPR) technology holds the promise to revolutionize gene editing technologies, which is transformative to the way we understand and treat diseases. This technique is based in a natural mechanism found in bacteria that allows a protein coupled to a single guide RNA (gRNA) strand to locate and make cuts in specific sites in the targeted genome. Being able to computationally predict the efficiency and specificity of gRNA is central to the success of gene editing.

Transcribed from DNA sequences, RNA is an important type of biological sequence of ribonucleotides (A, U, G, C), which folds into 3D structure. Benefiting from recent advance in large language models (LLMs), a variety of computational biology tasks can be solved by fine-tuning biological LLMs pre-trained on billions of known biological sequences. The downstream tasks on RNAs are relatively understudied.

In this post, we adopt a pre-trained genomic LLMs for gRNA efficiency prediction. The idea is to treat a computer designed gRNA as a sentence, and fine-tune the LLM to perform sentence-level regression tasks analogous to sentiment analysis. We used Parameter-Efficient Fine-Tuning methods to reduce the number of parameters and GPU usage for this task.

Solution overview

Large language models (LLMs) have gained a lot of interest for their ability to encode syntax and semantics of natural languages. The neural architecture behind LLMs are transformers, which are comprised of attention-based encoder-decoder blocks that generate an internal representation of the data they are trained from (encoder) and are able to generate sequences in the same latent space that resemble the original data (decoder). Due to their success in natural language, recent works have explored the use of LLMs for molecular biology information, which is sequential in nature.

DNABERT is a pre-trained transformer model with non-overlapping human DNA sequence data. The backbone is a BERT architecture made up of 12 encoding layers. The authors of this model report that DNABERT is able to capture a good feature representation of the human genome that enables state-of-the-art performance on downstream tasks like promoter prediction and splice/binding site identification. We decided to use this model as the foundation for our experiments.

Despite the success and popular adoption of LLMs, fine-tuning these models can be difficult because of the number of parameters and computation necessary for it. For this reason, Parameter-Efficient Fine-Tuning (PEFT) methods have been developed. In this post, we use one of these methods, called LoRA (Low-Rank Adaptation). We introduce the method in the following sections.

The following diagram is a representation of the Cas9 DNA target mechanism. The gRNA is the component that helps target the cleavage site.

The goal of this solution is to fine-tune a base DNABERT model to predict activity efficiency from different gRNA candidates. As such, our solution first takes gRNA data and processes it, as described later in this post. Then we use an Amazon SageMaker notebook and the Hugging Face PEFT library to fine-tune the DNABERT model with the processed RNA data. The label we want to predict is the efficiency score as it was calculated in experimental conditions testing with the actual RNA sequences in cell cultures. Those scores describe a balance between being able to edit the genome and not damage DNA that wasn’t targeted.

The following diagram illustrates the workflow of the proposed solution.

Prerequisites

For this solution, you need access to the following:

  • A SageMaker notebook instance (we trained the model on an ml.g4dn.8xlarge instance with a single NVIDIA T4 GPU)
  • transformers-4.34.1
  • peft-0.5.0
  • DNABERT 6

Dataset

For this post, we use the gRNA data released by researchers in a paper about gRNA prediction using deep learning. This dataset contains efficiency scores calculated for different gRNAs. In this section, we describe the process we followed to create the training and evaluation datasets for this task.

To train the model, you need a 30-mer gRNA sequence and efficiency score. A k-mer is a contiguous sequence of k nucleotide bases extracted from a longer DNA or RNA sequence. For example, if you have the DNA sequence “ATCGATCG” and you choose k = 3, then the k-mers within this sequence would be “ATC,” “TCG,” “CGA,” “GAT,” and “ATC.”

Efficiency score

Start with excel file 41467_2021_23576_MOESM4_ESM.xlsx from the CRISPRon paper in the Supplementary Data 1 section. In this file, the authors released the gRNA (20-mer) sequences and corresponding total_indel_eff scores. We specifically used the data from the sheet named spCas9_eff_D10+dox. We use the total_indel_eff column as the efficiency score.

Training and validation data

Given the 20-mers and the crispron scores (same as the total_indel_eff scores) from earlier, complete the following steps to put together the training and validation data:

  1. Convert the sequences in the sheet “TRAP12K microarray oligos” into an .fa (fasta) file.
  2. Run the script get_30mers_from_fa.py (from the CRISPRon GitHub repository) to obtain all possible 23-mers and 30-mers from the sequences obtained from Step 1.
  3. Use the CRISPRspec_CRISPRoff_pipeline.py script (from the CRISPRon GitHub repository) to obtain the binding energy for the 23-mers obtained from Step 2. For more details on how to run this script, check out the code released by the authors of the CRISPRon paper(check the script CRISPRon.sh).
  4. At this point, we have 23-mers along with the corresponding binding energy scores, and 20-mers along with the corresponding CRISPRon scores. Additionally, we have the 30-mers from Step 2.
  5. Use the script prepare_train_dev_data.py (from our released code) to create training and validation splits. Running this script will create two files: train.csv and dev.csv.

The data looks something like the following:

id,rna,crisproff_score,crispron_score
seq2875_p_129,GTCCAGCCACCGAGACCCTGTGTATGGCAC,24.74484099890205,85.96491228
seq2972_p_129,AAAGGCGAAGCAGTATGTTCTAAAAGGAGG,17.216228493196073,94.81132075
. . .
. . .

Model architecture for gRNA encoding

To encode the gRNA sequence, we used the DNABERT encoder. DNABERT was pre-trained on human genomic data, so it’s a good model to encode gRNA sequences. DNABERT tokenizes the nucleotide sequence into overlapping k-mers, and each k-mer serves as a word in the DNABERT model’s vocabulary. The gRNA sequence is broken into a sequence of k-mers, and then each k-mer is replaced by an embedding for the k-mer at the input layer. Otherwise, the architecture of DNABERT is similar to that of BERT. After we encode the gRNA, we use the representation of the [CLS] token as the final encoding of the gRNA sequence. To predict the efficiency score, we use an additional regression layer. The MSE loss will be the training objective. The following is a code snippet of the DNABertForSequenceClassification model:

class DNABertForSequenceClassification(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.config = config
        
        self.bert = BertModel(config)
        classifier_dropout = (
            config.classifier_dropout
            if config.classifier_dropout is not None
            else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = (
            return_dict if return_dict is not None else self.config.use_return_dict
        )

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        print('bert outputs', outputs)
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (
                    labels.dtype == torch.long or labels.dtype == torch.int
                ):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)
        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

Fine-tuning and prompting genomic LLMs

Fine-tuning all the parameters of a model is expensive because the pre-trained model becomes much larger. LoRA is an innovative technique developed to address the challenge of fine-tuning extremely large language models. LoRA offers a solution by suggesting that the pre-trained model’s weights remain fixed while introducing trainable layers (referred to as rank-decomposition matrices) within each transformer block. This approach significantly reduces the number of parameters that need to be trained and lowers the GPU memory requirements, because most model weights don’t require gradient computations.

Therefore, we adopted LoRA as a PEFT method on the DNABERT model. LoRA is implemented in the Hugging Face PEFT library. When using PEFT to train a model with LoRA, the hyperparameters of the low rank adaptation process and the way to wrap base transformers models can be defined as follows:

from peft import LoraConfig

tokenizer = AutoTokenizer.from_pretrained(
        data_training_args.model_path,
        do_lower_case=False
    )
# DNABertForSequenceClassification is a model class for sequence classification task, which is built on top of the DNABert architecture.    
model = DNABertForSequenceClassification.from_pretrained(
        data_training_args.model_path,
        config=config
    )
    
# Define LoRA Config
LORA_R = 16
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
peft_config = LoraConfig(
                     r=LORA_R, # the dimension of the low-rank matrices
                     lora_alpha=LORA_ALPHA, #scaling factor for the weight matrices
                     lora_dropout=LORA_DROPOUT, #dropout probability of the LoRA layers
                     bias="none",
                     task_type = 'SEQ_CLS'
    )
model = get_peft_model(model, peft_config)

Hold-out evaluation performances

We use RMSE, MSE, and MAE as evaluation metrics, and we tested with rank 8 and 16. Furthermore, we implemented a simple fine-tuning method, which is simply adding several dense layers after the DNABERT embeddings. The following table summarizes the results.

Method RMSE MSE MAE
LoRA (rank = 8) 11.933 142.397 7.014
LoRA (rank = 16) 13.039 170.01 7.157
One dense layer 15.435 238.265 9.351
Three dense layer 15.435 238.241 9.505
CRISPRon 11.788 138.971 7.134

When rank=8, we have 296,450 trainable parameters, which is about 33% trainable of the whole. The performance metrics are “rmse”: 11.933, “mse”: 142.397, “mae”: 7.014.

When rank=16, we have 591,362 trainable parameters, which is about 66% trainable of the whole. The performance metrics are “rmse”: 13.039, “mse”: 170.010, “mae”: 7.157. There might have some overfitting issue here under this setting.

We also compare what happens when adding a few dense layers:

  • After adding one dense layer, we have “rmse”: 15.435, “mse”: 238.265, “mae”: 9.351
  • After adding three dense layers, we have “rmse”: 15.435, “mse”: 238.241, “mae”: 9.505

Lastly, we compare with the existing CRISPRon method. CRISPRon is a CNN based deep learning model. The performance metrics are “rmse”: 11.788, “mse”: 138.971, “mae”: 7.134.

As expected, LoRA is doing much better than simply adding a few dense layers. Although the performance of LoRA is a bit worse than CRISPRon, with thorough hyperparameter search, it is likely to outperform CRISPRon.

When using SageMaker notebooks, you have the flexibility to save the work and data produced during the training, turn off the instance, and turn it back on when you’re ready to continue the work, without losing any artifacts. Turning off the instance will keep you from incurring costs on compute you’re not using. We highly recommend only turning it on when you’re actively using it.

Conclusion

In this post, we showed how to use PEFT methods for fine-tuning DNA language models using SageMaker. We focused on predicting efficiency of CRISPR-Cas9 RNA sequences for their impact in current gene-editing technologies. We also provided code that can help you jumpstart your biology applications in AWS.

To learn more about the healthcare and life science space, refer to Run AlphaFold v2.0 on Amazon EC2 or fine-tuning Fine-tune and deploy the ProtBERT model for protein classification using Amazon SageMaker.


About the Authors

Siddharth Varia is an applied scientist in AWS Bedrock. He is broadly interested in natural language processing and has contributed to AWS products such as Amazon Comprehend. Outside of work, he enjoys exploring new places and reading. He got interested in this project after reading the book The Code Breaker.

Yudi Zhang is an Applied Scientist at AWS marketing. Her research interests are in the area of graph neural networks, natural language processing, and statistics.

Erika Pelaez Coyotl is a Sr Applied Scientist in Amazon Bedrock, where she’s currently helping develop the Amazon Titan large language model. Her background is in biomedical science, and she has helped several customers develop ML models in this vertical.

Zichen Wang is a Sr Applied Scientist in AWS AI Research & Education. He is interested in researching graph neural networks and applying AI to accelerate scientific discovery, specifically on molecules and simulations.

Rishita Anubhai is a Principal Applied Scientist in Amazon Bedrock. She has deep expertise in natural language processing and has contributed to AWS projects like Amazon Comprehend, Machine Learning Solutions Lab, and development of Amazon Titan models. She’s keenly interested in using machine learning research, specifically deep learning, to create tangible impact.

Read More

Improve RAG performance using Cohere Rerank

Improve RAG performance using Cohere Rerank

This post is co-written with Pradeep Prabhakaran from Cohere.

Retrieval Augmented Generation (RAG) is a powerful technique that can help enterprises develop generative artificial intelligence (AI) apps that integrate real-time data and enable rich, interactive conversations using proprietary data.

RAG allows these AI applications to tap into external, reliable sources of domain-specific knowledge, enriching the context for the language model as it answers user queries. However, the reliability and accuracy of the responses hinges on finding the right source materials. Therefore, honing the search process in RAG is crucial to boosting the trustworthiness of the generated responses.

RAG systems are important tools for building search and retrieval systems, but they often fall short of expectations due to suboptimal retrieval steps. This can be enhanced using a rerank step to improve search quality.

RAG is an approach that combines information retrieval techniques with natural language processing (NLP) to enhance the performance of text generation or language modeling tasks. This method involves retrieving relevant information from a large corpus of text data and using it to augment the generation process. The key idea is to incorporate external knowledge or context into the model to improve the accuracy, diversity, and relevance of the generated responses.

Workflow of RAG Orchestration

The RAG orchestration generally consists of two steps:

  1. Retrieval – RAG fetches relevant documents from an external data source using the generated search queries. When presented with the search queries, the RAG-based application searches the data source for relevant documents or passages.
  2. Grounded generation – Using the retrieved documents or passages, the generation model creates educated answers with inline citations using the fetched documents.

The following diagram shows the RAG workflow.

Document retrieval in RAG orchestration

One technique for retrieving documents in a RAG orchestration is dense retrieval, which is an approach to information retrieval that aims to understand the semantic meaning and intent behind user queries. Dense retrieval finds the closest documents to a user query in the embedding, as shown in the following screenshot.

The goal of dense retrieval is to map both the user queries and documents (or passages) into a dense vector space. In this space, the similarity between the query and document vectors can be computed using standard distance metrics like cosine similarity or euclidean distance. The documents that match closest to the semantic meaning of the user query based on the calculated distance metrics are then presented back to the user.

The quality of the final responses to search queries is significantly influenced by the relevance of the retrieved documents. While dense retrieval models are very efficient and can scale to large datasets, they struggle with more complex data and questions due to the simplicity of the method. Document vectors contain the meaning of text in a compressed representation—typically 786-1536 dimension vectors. This often results in loss of information because information is compressed into a single vector. When documents are retrieved during a vector search the most relevant information is not always presented at the top of the retrieval.

Boost search accuracy with Cohere Rerank

To address the challenges with accuracy, search engineers have used two-stage retrieval as a means of increasing search quality. In these two-stage systems, a first-stage model (an embedding model or retriever) retrieves a set of candidate documents from a larger dataset. Then, a second-stage model (the reranker) is used to rerank those documents retrieved by the first-stage model.

A reranking model, such as Cohere Rerank, is a type of model that will output a similarity score when given a query and document pair. This score can be used to reorder the documents that are most relevant to the search query. Among the reranking methodologies, the Cohere Rerank model stands out for its ability to significantly enhance search accuracy. The model diverges from traditional embedding models by employing deep learning to evaluate the alignment between each document and the query directly. Cohere Rerank outputs a relevance score by processing the query and document in tandem, which results in a more nuanced document selection process.

In the following example, the application was presented with a query: “When was the transformer paper coauthored by Aidan Gomez published?” The top-k with k = 6 returned the results shown in the image, in which the retrieved result set did contain the most accurate result, although it was at the bottom of the list. With k = 3, the most relevant document would not be included in the retrieved results.

Cohere Rerank aims to reassess and reorder the relevance of the retrieved documents based on additional criteria, such as semantic content, user intent, and contextual relevance, to output a similarity score. This score is then used to reorder the documents by relevance of the query. The following image shows reorder results using Rerank.

By applying Cohere Rerank after the first-stage retrieval, the RAG orchestration can gain the benefits of both approaches. While first-stage retrieval helps to capture relevant items based on proximity matches within the vector space, reranking helps optimize search according to results by guaranteeing contextually relevant results are surfaced to the top. The following diagram demonstrates this improved efficiency.

The latest version of Cohere Rerank, Rerank 3, is purpose-built to enhance enterprise search and RAG systems. Rerank 3 offers state-of-the-art capabilities for enterprise search, including:

  • 4k context length to significantly improve search quality for longer documents
  • Ability to search over multi-aspect and semi-structured data (such as emails, invoices, JSON documents, code, and tables)
  • Multilingual coverage of more than 100 languages
  • Improved latency and lower total cost of ownership (TCO)

The endpoint takes in a query and a list of documents, and it produces an ordered array with each document assigned a relevance score. This provides a powerful semantic boost to the search quality of any keyword or vector search system without requiring any overhaul or replacement.

Developers and businesses can access Rerank on Cohere’s hosted API and on Amazon SageMaker. This post offers a step-by-step walkthrough of consuming Cohere Rerank on Amazon SageMaker.

Solution overview

This solution follows these high-level steps:

  1. Subscribe to the model package
  2. Create an endpoint and perform real-time inference

Prerequisites

For this walkthrough, you must have the following prerequisites:

  1. The cohere-aws notebook.

This is a reference notebook, and it cannot run unless you make changes suggested in the notebook. It contains elements that render correctly in the Jupyter interface, so you need to open it from an Amazon SageMaker notebook instance or in Amazon SageMaker Studio.

  1. An AWS Identity and Access Management (IAM) role with the AmazonSageMakerFullAccess policy attached. To deploy this machine learning (ML) model successfully, choose one of the following options:
    1. If your AWS account does not have a subscription to Cohere Rerank 3 Model – Multilingual, your IAM role needs to have the following three permissions, and you need to have the authority to make AWS Marketplace subscriptions in the AWS account used:
      • aws-marketplace:ViewSubscriptions
      • aws-marketplace:Unsubscribe
      • aws-marketplace:Subscribe
    2. If your AWS account has a subscription to Cohere Rerank 3 Model – Multilingual, you can skip the instructions for subscribing to the model package.

Refrain from using full access in production environments. Security best practice is to opt for the principle of least privilege.

Implement Rerank 3 on Amazon SageMaker

To improve RAG performance using Cohere Rerank, use the instructions in the following sections.

Subscribe to the model package

To subscribe to the model package, follow these steps:

  1. In AWS Marketplace, open the model package listing page Cohere Rerank 3 Model – Multilingual
  2. Choose Continue to Subscribe.
  3. On the Subscribe to this software page, review the End User License Agreement (EULA), pricing, and support terms and choose Accept Offer.
  4. Choose Continue to configuration and then choose a Region. You will see a Product ARN displayed, as shown in the following screenshot. This is the model package Amazon Resource Name (ARN) that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your Region and enter it in the following cell.

The code snippets included in this post are sourced from the aws-cohere notebook. If you encounter any issues with this code, refer to the notebook for the most up-to-date version.

!pip install --upgrade cohere-aws
# if you upgrade the package, you need to restart the kernel

from cohere_aws import Client
import boto3

On the Configure for AWS CloudFormation page shown in the following screenshot, under Product Arn, make a note of the last part of the product ARN to use as the value in the variable cohere_package in the following code.

cohere_package = " cohere-rerank-multilingual-v3--13dba038aab73b11b3f0b17fbdb48ea0"

model_package_map = {

"us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{cohere_package}",

"us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{cohere_package}",

"us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{cohere_package}",

"us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{cohere_package}",

"ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{cohere_package}",

"eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{cohere_package}",

"eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{cohere_package}",

"eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{cohere_package}",

"eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{cohere_package}",

"eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{cohere_package}",

"ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{cohere_package}",

"ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{cohere_package}",

"ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{cohere_package}",

"ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{cohere_package}",

"ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{cohere_package}",

"sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{cohere_package}",

}

region = boto3.Session().region_name

if region not in model_package_map.keys():

raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, refer to the Amazon SageMaker Developer Guide.

Create an endpoint

To create an endpoint, use the following code.

co = Client(region_name=region)

co.create_endpoint(arn=model_package_arn, endpoint_name="cohere-rerank-multilingual-v3-0", instance_type="ml.g5.2xlarge", n_instances=1)

# If the endpoint is already created, you just need to connect to it

# co.connect_to_endpoint(endpoint_name="cohere-rerank-multilingual-v3-0”)

After the endpoint is created, you can perform real-time inference.

Create the input payload

To create the input payload, use the following code.

documents = [
    {"Title":"Contraseña incorrecta","Content":"Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?"},
    {"Title":"Confirmation Email Missed","Content":"Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?"},
    {"Title":"أسئلة حول سياسة الإرجاع","Content":"مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب"},
    {"Title":"Customer Support is Busy","Content":"Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?"},
    {"Title":"Falschen Artikel erhalten","Content":"Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken."},
    {"Title":"Customer Service is Unavailable","Content":"Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?"},
    {"Title":"Return Policy for Defective Product","Content":"Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."},
    {"Title":"收到错误物品","Content":"早上好,关于我最近的订单,我有一个问题。我收到了错误的商品,需要退货。"},
    {"Title":"Return Defective Product","Content":"Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."}
]

 

Perform real-time inference

To perform real-time inference, use the following code.

 

response = co.rerank(documents=documents, query='What emails have been about returning items?', rank_fields=["Title","Content"], top_n=5)

Visualize output

To visualize output, use the following code.

print(f'Documents: {response}')

The following screenshot shows the output response.

Cleanup

To avoid any recurring charges, use the following steps to clean up the resources created in this walkthrough.

Delete the model

Now that you have successfully performed a real-time inference, you do not need the endpoint anymore. You can terminate the endpoint to avoid being charged.

co.delete_endpoint()
co.close()

Unsubscribe to the listing (optional)

If you want to unsubscribe to the model package, follow these steps. Before you cancel the subscription, make sure that you don’t have a deployable model created from the model package or using the algorithm. You can find this information by looking at the container name associated with the model.

Steps to unsubscribe from the product from AWS Marketplace:

  1. On the Your Software subscriptions page, choose the Machine Learning tab
  2. Locate the listing that you want to cancel the subscription for, and then choose Cancel Subscription

Summary

RAG is a capable technique for developing AI applications that integrate real-time data and enable interactive conversations using proprietary information. RAG enhances AI responses by tapping into external, domain-specific knowledge sources, but its effectiveness depends on finding the right source materials. This post focuses on improving search efficiency and accuracy in RAG systems using Cohere Rerank. RAG orchestration typically involves two steps: retrieval of relevant documents and generation of answers. While dense retrieval is efficient for large datasets, it can struggle with complex data and questions due to information compression. Cohere Rerank uses deep learning to evaluate the alignment between documents and queries, outputting a relevance score that enables more nuanced document selection.

Customers can find Cohere Rerank 3 and Cohere Rerank 3 Nimble on Amazon Sagemaker Jumpstart.


About the Authors

Shashi Raina is a Senior Partner Solutions Architect at Amazon Web Services (AWS), where he specializes in supporting generative AI (GenAI) startups. With close to 6 years of experience at AWS, Shashi has developed deep expertise across a range of domains, including DevOps, analytics, and generative AI.

Pradeep Prabhakaran is a Senior Manager – Solutions Architecture at Cohere. In his current role at Cohere, Pradeep acts as a trusted technical advisor to customers and partners, providing guidance and strategies to help them realize the full potential of Cohere’s cutting-edge Generative AI platform.

Read More