Enable or disable ACL crawling safely in Amazon Q Business

Enable or disable ACL crawling safely in Amazon Q Business

Amazon Q Business recently added support for administrators to modify the default access control list (ACL) crawling feature for data source connectors.

Amazon Q Business is a fully managed, AI powered assistant with enterprise-grade security and privacy features. It includes over 40 data source connectors that crawl and index documents. By default, Amazon Q Business indexes ACL information attached to documents along with the documents themselves and uses this to filter chat responses based on the user’s document access. With this new feature, you can enable or disable ACL crawling as required by their business use case.

This post introduces the new ACL toggle feature for Amazon Q Business, which you can use to enable or disable ACL crawling. We’ll explore use cases for disabling ACLs and discuss how to safely enable or disable ACL crawling.

Overview of access control list crawling

Amazon Q Business data source connectors help crawl various data sources to collect and index content in Amazon Q Business for fast discovery and retrieval when answering user queries. These data sources often contain documents with different classifications such as public, internal public, private, and confidential. To provide fine-grained control over access rights, you can attach ACLs to documents, allowing you to specify different levels of access for various users or groups. To verify that Amazon Q Business respects access control policies and that users only receive responses for content they’re authorized to access, the data source connectors automatically crawl for access permissions associated with the content, user identifiers, and groups.

The preceding figure illustrates the Amazon Q Business data source crawler with ACL crawling enabled. As the connector retrieves content from the data source, it examines the associated ACL and compiles a list of users and groups with read permissions for each document. The connector also collects user identifiers, which are stored in the Amazon Q Business user store for quick matching during query execution. Both the ACL and content are optimized and stored in the Amazon Q Business index storage, enabling secure and efficient retrieval when answering user queries. For more information on the user store, see Understanding Amazon Q Business User Store.

When to disable ACL crawling?

ACL crawling builds a security-aware index that respects access control policies in the primary data source. This process helps maintain data privacy and access control required for regulatory compliance, making sure that sensitive information isn’t inadvertently exposed through user query results. It provides a scalable mechanism to handle large amounts of content while maintaining consistency between the actual access controls on the data and what’s discoverable through search. Because of these advantages, ACL crawling is strongly recommended for all data sources. However, there are some circumstances when you might need to disable it. The following are some reasons why you might disable ACL crawling.

Internally public content

Organizations often designate certain data sources as internally public, including HR policies, IT knowledge bases, and wiki pages. For instance, a company might allocate an entire Microsoft SharePoint site for policies accessible to all employees, classifying it as internal-public. In such cases, crawling ACLs for permissions that include all employees can be costly and create unnecessary overhead. Turning off ACL crawling might be advantageous in these scenarios.

Data source contains irreconcilable identities

Amazon Q Business requires all users to authenticate with an enterprise-approved identity provider (IdP). After successful authentication, Amazon Q Business uses the IdP-provided user identifier to match against the user identifier fetched from the data source during ACL crawling. This process validates user access to content before retrieving it for query responses.

However, because of legacy issues such as mergers and acquisitions, data source configuration limitations, or other constraints, the primary user identifier from the IdP might differ from the one in the data source. This discrepancy can prevent Amazon Q Business from retrieving relevant content from the index and answering user queries effectively.

In such cases, it might be necessary to disable ACL crawling and use alternative options. These include implementing attribute filters or building dedicated restricted applications with access limited to specific audiences and content. For more information on attribute filters, see Filtering chat responses using document attributes.

Use case-driven targeted deployments

As a fully managed service, Amazon Q Business can be quickly deployed in multiple instances for scoped down targeted use cases. Examples include an HR bot in Slack or an AI assistant for customer support agents in a contact center. Because these AI assistants might be deployed for a limited audience, and the indexed content might be generally available to all users with application access, ACL crawling can be turned off.

Note of caution

Amazon Q Business cannot enforce access controls if ACL crawling is disabled. When ACL crawling is disabled for a data source, indexed content in that source will be considered accessible to users with access to the Amazon Q Business application. Therefore, disabling ACL crawling should be done with caution and due diligence. The following are some recommended best practices:

  • Notify data source content owners and administrators of your intent to disable ACL crawling and obtain their approval beforehand.
  • If applicable, consider implementing alternative options such as attribute filtering to restrict content retrieval or deploying a scoped-down, use-case-driven deployment to a limited audience.
  • Maintain a decision document that clearly articulates the reasons for disabling ACL crawling, the scope of affected content, and precautions taken to prevent indexing of sensitive information.

Note: As a precaution, you cannot disable ACL crawling for an existing Amazon Q Business data source that already has ACL crawling enabled. To disable ACL crawling, you must delete the data source and recreate it. You can only disable ACL crawling during the data source creation process, and this requires an account administrator to grant permission for disabling ACL crawling when configuring the data source.

Procedures for configuring ACL crawling

Amazon Q Business ACL crawling helps protect your data. Amazon Q Business provides safeguards to help administrators and developers mitigate accidentally disabling ACL crawling. In this section, we will cover how you can allow or deny the ACL crawling disable feature, explore procedures to enable or disable ACL crawling, explain how to monitor logs for ACL crawling configuration changes, and troubleshoot common issues.

Personas for configuring ACL crawling

ACL crawling configuration typically involves multiple roles, depending on your organizational structure. To maximize safeguards, it’s recommended that these roles are filled by different individuals. For faster deployments, identify the necessary personnel within your organization before starting the project and ensure they collaborate to complete the configuration. Here are the common roles needed for ACL crawling configuration:

  1. AWS account administrator – An AWS account administrator is a user with full access to AWS services and the ability to manage IAM resources and permissions in the account. They can create and manage organizations, enabling centralized management of multiple AWS accounts.
  2. Amazon Q Business administrator – An Amazon Q Business administrator is typically a user or role responsible for managing and configuring the Amazon Q Business service. Their duties include creating and optimizing Amazon Q Business indexes, setting up guardrails, and tuning relevance. They also set up and maintain connections to various data sources that Amazon Q Business will index, such as Amazon Simple Storage Service (Amazon S3) buckets, SharePoint, Salesforce, and Confluence.

Prerequisites for ACL crawling

Process to disallow the option to disable ACL crawling

By default, the option to disable ACL crawling is enabled for an account. AWS account administrators can disallow this feature by setting up an account-level policy. It’s recommended to configure an explicit deny for production accounts by default. The following below shows the associated actions in relation to the personas involved in the configuration process.

Administrators can attach the IAM action qbusiness:DisableAclOnDataSource to the Amazon Q Business administrator user or role policy to deny or allow the option to disable ACL crawling. The example IAM policy code snippet that follows demonstrates how to set up an explicit deny.

{
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Deny",
          "Action": [
                "qbusiness:DisableAclOnDataSource"
            ],
          "Resource": ["*"]
       }
    ]
}

Note that even if the option to disable ACL crawling is denied, the user interface might not gray out this option. However, if you attempt to create a data source with this option disabled, it will fail the validation check, and Amazon Q Business will not create the data source.

Process to disable ACL crawling for a data source connector

Before setting up a data source connector with ACL crawling disabled in your Amazon Q Business application deployment, make sure that you have no sensitive content in the data source or have implemented controls to help prevent accidental content exposure. Verify that the data source connector supports the option to disable ACL crawling. Notify information custodians, content owners, and data source administrators of your intent to disable ACL crawling and obtain their documented approvals, if necessary. If your account administrator has explicitly denied the option to disable ACL crawling, request temporary permission. After you have secured all approvals and exceptions, create a new data source with ACL crawling disabled and sync the data. With ACL crawling disabled, Amazon Q Business users will be able to discover knowledge and obtain answers from the indexed documents through this connector. Notify the account administrator to revert the account policy back to explicitly denying the disable ACL crawling option. The process and interaction between different roles are shown in the following chart.

The following is an overview of the procedure to create a data source with ACL crawling disabled using AWS Console:

  1. Navigate to the Amazon Q Business console.
  2. Select the Amazon Q Business application that you want to add a data source connector to.
  3. Choose Add data source in the Data sources section and select the desired connector.
  4. Update the connector configuration information. See Connecting Amazon Q Business data sources for configuration details.
  5. In the Authorization section, choose Disable ACLs and check the acknowledgment to accept the risks of disabling ACL crawling.
  6. Complete the remaining connector configuration and choose Save.
  7. Sync the data source.

Note: You cannot disable ACL crawling for an existing data source connector that was created with ACL crawling enabled. You must create a new data source connector instance with ACL disabled and delete the older instance that has ACL crawling enabled.

Process to enable ACL crawling for a data source connector

Creating a data source connector with ACL crawling enabled is recommended and doesn’t require additional allow listing from AWS account administrators. To enable ACL crawling, you follow steps similar to disabling ACLs as described in the previous section. When configuring the data source connector using the console, choose Enable ACLs in the Authorization section to create a connector with ACL crawling enabled. You can also enable ACL crawling at any time for an existing data source connector that was created with this option disabled. Sync the data source connector for the ACL enforcement to take effect. Amazon Q Business users can only query and obtain answers from documents to which they have access in the original data source.

It’s important to review that the data source administrator has set up the required permissions properly, making sure that the crawler has permission to crawl for ACLs in the data source before enabling ACL crawling. You can find the required permissions in the prerequisite section of the connector in Connecting Amazon Q Business data sources. The following shows the process for setting up a data source connector with ACL crawling enabled.

Logging and monitoring the ACL crawling configuration

Amazon Q Business uses AWS CloudTrail for logging API calls related to ACL crawling configuration. You can monitor the CloudTrail log for CreateDataSource and UpdateDataSource API calls to identify ACL crawling-related changes made to data source configuration. For a complete list of Amazon Q Business APIs that are logged to CloudTrail, see Logging Amazon Q Business API calls using AWS CloudTrail.

Administrators can configure Amazon CloudWatch alarms to generate automated alert notifications if ACL crawling is disabled for a data source connector, allowing them to initiate corrective action. For step-by-step instructions on setting up CloudWatch alarms based on CloudTrail events, see How do I use CloudWatch alarms to monitor CloudTrail events.

The example CloudWatch alarm code snippet that follows shows the filter pattern for identifying events related to disabling ACL crawling in a data source connector.

{
    ($.eventSource = qbusiness.amazonaws.com)
    && (
        ($.eventName = CreateDataSource)
        || ($.eventName = UpdateDataSource)
    )
    && ($.requestParameters.disableAclCrawl = true) 
}

Tips for troubleshooting

When configuring Amazon Q Business data source connectors, you might occasionally encounter issues. The following are some common errors and their possible resolutions.

Not authorized to disable ACL crawling

When creating a new data source connector with ACL crawling disabled, you might see an error message stating not authorized to perform: qbusiness:DisableAclOnDataSource as shown in the following image.

This error indicates that your administrator has explicitly denied the option to disable ACL crawling for your AWS account. Contact your administrator to allow-list this action for your account. For more details, see the Process to disable ACL crawling for a data source connector section earlier in this post.

Data source connection errors

Data source connectors might also fail to connect to your data source or crawl data. In such cases, verify that Amazon Q Business can reach the data source through the public internet or through a VPC private network. See Connecting Amazon Q Business data sources to make sure that your data source authentication has the permissions needed to crawl content and ACLs, if enabled.

Identity and ACL mismatch errors

Finally, after successfully syncing data with ACL crawling enabled, some users might still be unable to get answers to queries, even though the relevant documents were indexed. This issue commonly occurs when the user lacks access to the indexed content in the original data source, or when the user identity obtained from the data source doesn’t match the sign-in identity. To troubleshoot such ACL mismatch issues, examine the data source sync report. For more information, see Introducing document-level sync reports: Enhanced data sync visibility in Amazon Q Business.

Key considerations and recommendations

Given the impact that disabling ACL crawling can have on content security, consider these restrictions and best practices when disabling ACL crawling in Amazon Q Business data source connectors:

  • ACL crawling enablement is a one-way control mechanism. After it’s enabled, you cannot disable it. This helps prevent accidentally disabling ACL crawling in production environments.
  • Keep ACL crawling enabled by default and disable it only for the subset of data source connectors that require it.
  • If necessary, consider splitting the indexing of a data source by setting up multiple data source connectors and limiting ACL crawling disablement to a smaller content segment. Use the document Inclusion and Exclusion feature of data source connectors to define the indexing scope.
  • When ACL crawling is disabled because of irreconcilable identities, consider alternative options. These include implementing attribute filters, restricting access to the Amazon Q Business application, and setting up guardrails.
  • As a security best practice, AWS Organizations and account administrators should add a service control policy to explicitly deny the qbusiness:DisableAclOnDataSource permission for all accounts. Grant this permission only when requested by an Amazon Q Business administrator. After configuring a data source connector with ACL crawling disabled, revert to an explicit deny. Use a ticketing system to maintain a record of exception approvals. For more information, see <link>.
  • Currently, disabling ACL crawling is available for limited connectors, including ServiceNow, Confluence, SharePoint, Jira, Google Drive, OneDrive, Salesforce, Zendesk, GitHub, MS Teams, and Slack. For the latest list of connectors that support disabling ACL crawling, see Connecting Amazon Q Business data sources.

Clean up

To avoid incurring additional charges, make sure you delete any resources created in this post.

  1. To delete any data source created in Amazon Q Business, follow the instructions in Deleting an Amazon Q Business data source connector to delete the same.
  2. To delete any Amazon Q Business application created, follow the instructions in Deleting an application.

Conclusion

Amazon Q Business data source connector ACL crawling is an essential feature that helps organizations build, manage, and scale secure AI assistants. It plays a crucial role in enforcing regulatory and compliance policies and protecting sensitive content. With the introduction of a self-service feature to disable ACL crawling, Amazon Q Business now provides you more autonomy to choose deployment options that suit your organization’s business needs. To start building secure AI assistants with Amazon Q Business, explore the Getting started guide.


About the Authors

Rajesh Kumar Ravi, a Senior Solutions Architect at Amazon Web Services, specializes in building generative AI solutions using Amazon Q Business, Amazon Bedrock, and Amazon Kendra. He helps businesses worldwide implement these technologies to enhance efficiency, innovation, and competitiveness. An accomplished technology leader, Rajesh has experience developing innovative AI products, nurturing the builder community, and contributing to new ideas. Outside of work, he enjoys walking and short hiking trips.

Meenakshisundaram Thandavarayan works for AWS as an AI/ML Specialist. He has a passion to design, create, and promote human-centered data and analytics experiences. Meena focuses on developing sustainable systems that deliver measurable, competitive advantages for strategic customers of AWS. Meena is a connector and design thinker and strives to drive business to new ways of working through innovation, incubation, and democratization.

Amit Choudhary is a Product Manager for Amazon Q Business connectors. He loves to build products that make it easy for customers to use privacy-preserving technologies (PETs) such as differential privacy

Keerthi Kumar Kallur is a Software Development Engineer at AWS. He is part of the Amazon Q Business team and worked on various features with customers. In his spare time, he likes to do outdoor activities such as hiking and sports such as volleyball.

Read More

SK Telecom improves telco-specific Q&A by fine-tuning Anthropic’s Claude models in Amazon Bedrock

SK Telecom improves telco-specific Q&A by fine-tuning Anthropic’s Claude models in Amazon Bedrock

SK Telecom (SKT), South Korea’s leading telecommunications company serving 30 million customers, is at the forefront of AI innovation. In line with its AI Pyramid Strategy, which aims to unlock AI’s potential for anyone, anywhere, anytime, SKT has collaborated with the AWS Generative AI Innovation Center (GenAIIC) Custom Model Program to explore domain-trained models using Amazon Bedrock for telco-specific use cases.

This collaboration aligns with SKT’s vision of using AI expertise and strategic partnerships to develop innovative AI-based products and services. One such initiative focused on developing a custom solution for grounded question answering (Q&A) based on reference documents.

Retrieval Augmented Generation (RAG) is a popular technique for Q&A tasks, offering improved factual accuracy and knowledge grounding. However, RAG faces challenges with generating a response not matching preferred tone, style, and manners for telco use cases, as well as retrieving irrelevant documents, potentially leading to inaccurate responses. To address this, SKT and AWS GenAIIC aimed to use model customization to improve Anthropic Claude models on Amazon Bedrock in three key areas:

  • Providing concise and informative answers
  • Correctly referencing links from retrieved documents
  • Answering in a tone and style consistent with SKT and similar to ground truth answers

Additionally, the team explored boosting smaller model performance using synthetic data generated by bigger large language models (LLMs) for knowledge distillation and scenarios with limited labeled training data.

Amazon Bedrock is a fully managed service that offers a variety of LLMs and foundation models (FMs) along with capabilities such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and Amazon Bedrock Guardrails that can expedite many generative AI use cases. Amazon Bedrock is the only fully managed service that provides you with the ability to fine-tune Claude models. Amazon Bedrock offers an intuitive and secure way of fine-tuning Anthropic’s Claude models and more. The fine-tuned Claude model can be deployed using Amazon Bedrock and can use the capabilities of Amazon Bedrock seamlessly, for example, Amazon Bedrock Knowledge Bases for the telco domain-specific RAG or Amazon Bedrock Agents for the agentic usage.

In this post, we share how SKT customizes Anthropic Claude models for telco-specific Q&A regarding technical telecommunication documents of SKT using Amazon Bedrock.

Solution overview

The team explored combinations of prompt optimization, customization (fine-tuning), and data augmentation with synthetic data. This multifaceted approach aimed to maximize the benefits of each technique for the grounded Q&A generation task.

In the following sections, we explore these methods in more detail.

Anthropic’s Claude customization with prompt optimization

Fine-tuning, which is available through Amazon Bedrock for various FMs, including Anthropic’s Claude, allows adaptation of pre-trained language models for specific use cases. It’s particularly effective for tailoring response style and format adherence.

The team first optimized the system prompt, implementing standardized guidelines for answer formatting and document citation based on Anthropic model prompting best practices. Key focus areas included:

  • Clear presentation of system commands
  • Consistent use of code block formatting
  • Context-based tailored responses

This prompt engineering, combined with fine-tuning, yielded substantial improvements:

  • Over 50% increase in ROUGE-3 score
  • Over 25% improvement in ROUGE-L score
  • Over 4% increase in embedding similarity score
  • Significant progress in accurate reference citation

The iterative enhancement process demonstrated cumulative benefits, with prompt updates alone showing 35–40 percent improvements in key metrics, and the final customized model achieving 50–60 percent gains in some metrics.

This progression clearly illustrates the cumulative benefits of model customization through RAG, prompt engineering, and fine-tuning, resulting in a model that significantly outperformed both the baseline and the prompt-updated versions in terms of ROUGE scores and citation accuracy. ROUGE score measures the similarity between ground truths and generated results by computing N-gram word overlaps. The following table summarizes these improvements.

LLM Prompt update Fine-tuning Relative improvement over baseline
ROUGE-3 ROUGE-L Citation accuracy
Anthropic’s Claude 3 Sonnet baseline baseline baseline
Anthropic’s Claude 3 Sonnet ✅ +38.30% +13.4% +52.94%
Anthropic’s Claude 3 Sonnet ✅ ✅ +58.1% +26.8% +70.59%

Synthetic data for fine-tuning

To address the challenge of limited high-quality labeled training data, the team explored synthetic data generation techniques. This approach also facilitates knowledge distillation from larger LLMs to smaller, more targeted models, offering benefits such as lower latency and cost.

The team conducted controlled experiments using:

  • A baseline set of 500 ground truth samples
  • An augmented set with 500 original over 1,500 synthetic samples
  • A larger original set of 2,000 samples

Synthetic data was generated using Anthropic’s Claude Sonnet 3, creating new question-answer pairs over the same retrieved documents used in ground truth examples.

The results were evaluated using both LLM-based comparison and human preference evaluation. Human evaluators blindly ranked model outputs, with scores assigned based on preference (Best: 4, Second: 3, Third: 2, Worst: 1). The following table shows the results of the human preference evaluation scores.

Rank Model Cumulative score
(best possible: 160)
1 Fine-tuned with 2,000 original samples 114
2 Fine-tuned with 500 original and 1,500 synthetic samples 112
3 Fine-tuned with 500 original samples 85
4 No fine-tuning (baseline) 84

Some key findings include:

  • Small training sets (500 samples) showed minimal improvement over baseline
  • Larger training sets (2,000 samples) scored considerably higher
  • Synthetically augmented data performed similarly to equivalent-sized original data

Although having a large volume of domain-specific training data is always ideal, many businesses have limited available datasets. In such scenarios, synthetic data can play a crucial role in place of original data. This demonstrates the potential of synthetic data for model customization.

Conclusion

SK Telecom’s collaboration with AWS GenAIIC showcases the company’s commitment to developing innovative AI solutions for telco challenges. By using Amazon Bedrock to customize Anthropic’s Claude models, SKT has achieved significant performance improvements for telco-specific, Korean language use cases without the need to build models from scratch. The proof of concept demonstrated significant improvements:

  • ~58% increase in ROUGE-3 score
  • ~27% increase in ROUGE-L score
  • Substantial improvement in returning correct reference links

This approach, combined with synthetic data generation techniques, aligns with SKT’s AI Pyramid Strategy, enabling faster testing and development of new approaches. As SKT continues to focus on key areas such as personal AI assistants, AI healthcare, and AI data centers, this collaboration with AWS represents a significant step in their AI evolution and long-term competitiveness in the global AI landscape.

For those interested in working with AWS on similar projects, visit Generative AI Innovation Center.


About the Authors

Sungmin Hong is a Senior Applied Scientist at AWS Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, reading and cooking.

Sujeong Cha is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she specializes in model customization and optimization. She has extensive hands-on experience in solving customers’ business use cases by utilizing generative AI as well as traditional AI/ML solutions. Sujeong holds a M.S. degree in Data Science from New York University.

Arijit Ghosh Chowdhury is a Scientist with the AWS Generative AI Innovation Center, where he works on model customization and optimization. In his role, he works on applied research in fine-tuning and model evaluations to enable GenAI for various industries. He has a Master’s degree in Computer Science from the University of Illinois at Urbana Champaign, where his research focused on question answering, search and domain adaptation.

Yiyue Qian is an Applied Scientist II at the AWS Generative AI Innovation Center, where she supports providing generative AI solutions to AWS customers. In this role, she collaborates with a team of experts to develop innovative AI-driven models for AWS customers across various industries. Yiyue holds a Ph.D. in Computer Science from the University of Notre Dame, where her research focused on advanced machine learning and deep learning techniques.

Wei-Chih Chen is a Machine Learning Engineer at the AWS Generative AI Innovation Center, where he works on model customization and optimization for LLMs. He also builds tools to help his team tackle various aspects of the LLM development life cycle—including fine-tuning, benchmarking, and load-testing—that accelerating the adoption of diverse use cases for AWS customers. He holds an M.S. degree in Computer Science from UC Davis.

Hannah Marlowe is a Senior Manager of Model Customization at the AWS Generative AI Innovation Center. Her team specializes in helping customers develop differentiating Generative AI solutions using their unique and proprietary data to achieve key business outcomes. She holds a Ph.D in Physics from the University of Iowa, with a focus on astronomical X-ray analysis and instrumentation development. Outside of work, she can be found hiking, mountain biking, and skiing around the mountains in Colorado.

Seunghyeon Jeong (Steve) is a team leader of the Platform Application team at SKT. He is responsible for commercializing the Global Intelligence Platform (GIP), which provides AI models and tools. For most of his career, he has been a PM developing various mobile services such as mobile wallet, fashion streaming, and unified login services for SK. His team is expanding the delivery of models and features to make it easier for internal teams to apply AI, contributing to SKT’s AI Transformation. Before entering the AI space, he was a Product Manager, developing and operating various mobile services such as mobile wallet, fashion streaming, and unified login services for the US and Korea.

Sunwoo Lee (Lois) is the team leader of the Data Construction and Evaluation Team within SK Telecom’s Global AI Tech division. She oversees the design and construction of training data for language models, the model performance evaluation process, and its application to services. Her career has focused on NLP within IT, which is a great fit with her background in Linguistics and Korean language education. Alongside her world-class team, she continues to explore and solve fascinating problems such as how to optimize the design of data for language model training, which tasks and methods to implement for validating language model performance, and the best design of AI-human conversations.

Eric Davis is the vice president of the AI Tech Collaboration Group at SKT. Eric oversees tech collaborations with worldwide tech partners to customize large language models (LLMs) for the telecommunications domain. His teams are responsible for designing and building the datasets to tune LLMs, as well as benchmarking LLMs in general and for the telecommunications domain. Eric holds a Master of Science degree in Computer Science from Carnegie Mellon from the Language Technologies Institute and a Bachelor of Arts in Linguistics and Psychology from the University of California, Los Angeles.

Read More

Scaling Rufus, the Amazon generative AI-powered conversational shopping assistant with over 80,000 AWS Inferentia and AWS Trainium chips, for Prime Day

Scaling Rufus, the Amazon generative AI-powered conversational shopping assistant with over 80,000 AWS Inferentia and AWS Trainium chips, for Prime Day

Amazon Rufus is a shopping assistant experience powered by generative AI. It generates answers using relevant information from across Amazon and the web to help Amazon customers make better, more informed shopping decisions. With Rufus, customers can shop alongside a generative AI-powered expert that knows Amazon’s selection inside and out, and can bring it all together with information from across the web to help shoppers make more informed purchase decisions.

To meet the needs of Amazon customers at scale, Rufus required a low-cost, performant, and highly available infrastructure for inference. The solution needed the capability to serve multi-billion parameter large language models (LLMs) with low latency across the world to service its expansive customer base. Low latency makes sure users have a positive experience chatting with Rufus and can start getting responses in less than a second. To achieve this, the Rufus team is using multiple AWS services and AWS AI chips, AWS Trainium and AWS Inferentia.

Inferentia and Trainium are purpose-built chips developed by AWS that accelerate deep learning workloads with high performance and lower overall costs. With these chips, Rufus reduced its costs by 4.5 times lower than other evaluated solutions while maintaining low latency for its customers. In this post, we dive into the Rufus inference deployment using AWS chips and how this enabled one of the most demanding events of the year—Amazon Prime Day.

Solution overview

At its core, Rufus is powered by an LLM trained on Amazon’s product catalog and information from across the web. LLM deployment can be challenging, requiring you to balance factors such as model size, model accuracy, and inference performance. Larger models generally have better knowledge and reasoning capabilities but come at a higher cost due to more demanding compute requirements and increasing latency. Rufus would need to be deployed and scale to meet the tremendous demand of peak events like Amazon Prime Day. Considerations for this scale include how well it needs to perform, its environmental impact, and the cost of hosting the solution. To meet these challenges, Rufus used a combination of AWS solutions: Inferentia2 and Trainium, Amazon Elastic Container Service (Amazon ECS), and Application Load Balancer (ALB). In addition, the Rufus team partnered with NVIDIA to power the solution using NVIDIA’s Triton Inference Server, providing capabilities to host the model using AWS chips.

Rufus inference is a Retrieval Augmented Generation (RAG) system with responses enhanced by retrieving additional information such as product information from Amazon search results. These results are based on the customer query, making sure the LLM generates reliable, high-quality, and precise responses.

To make sure Rufus was best positioned for Prime Day, the Rufus team built a heterogeneous inference system using multiple AWS Regions powered by Inferentia2 and Trainium. Building a system across multiple Regions allowed Rufus to benefit in two key areas. First, it provided additional capacity that could be used during times of high demand, and second, it improved the overall resiliency of the system.

The Rufus team was also able to use both Inf2 and Trn1 instance types. Because Inf2 and Trn1 instance types use the same AWS Neuron SDK, the Rufus team was able to use both instances to serve the same Rufus model. The only configuration setting to adjust was the tensor parallelism degree (24 for Inf2, 32 for Trn1). Using Trn1 instances also led to an additional 20% latency reduction and throughput improvement compared to Inf2.

The following diagram illustrates the solution architecture.

To support real-time traffic routing across multiple Regions, Rufus built a novel traffic orchestrator. Amazon CloudWatch supported the underlying monitoring, helping the team adjust the traffic ratio across the different Regions in less than 15 minutes based on the traffic pattern changes. By using this type of orchestration, the Rufus team had the ability to direct requests to other Regions when needed, with a small trade-off of latency to the first token. Due to Rufus’s streaming architecture and the performant AWS network between Regions, the perceived latency was minimal for end-users.

These choices allowed Rufus to scale up over 80,000 Trainium and Inferentia chips across three Regions serving an average of 3 million tokens a minute while maintaining P99 less than 1 second latency to the first response for Prime Day customers. In addition, by using these purpose-built chips, Rufus achieved 54% better performance per watt than other evaluated solutions, which helped the Rufus team meet energy efficiency goals.

Optimizing inference performance and host utilization

Within each Region, the Rufus inference system used Amazon ECS, which managed the underlying Inferentia and Trainium powered instances. By managing the underlying infrastructure, the Rufus team only needed to bring their container and configuration by defining an ECS task. Within each container, an NVIDIA Triton Inference Server with a Python backend is used running vLLM with the Neuron SDK. vLLM is a memory-efficient inference and serving engine that is optimized for high throughput. The Neuron SDK makes it straightforward for teams to adopt AWS chips and supports many different libraries and frameworks such as PyTorch Lightning.

The Neuron SDK provides a straightforward LLM inference solution on Trainium and Inferentia hardware with optimized performance supporting a wide range of transformer-based LLM architectures. To reduce latency, Rufus has collaborated with the AWS Annapurna team to develop various optimizations such as INT8 (weight only) quantization, continuous batching with vLLM, resource, compute, and memory bandwidth in the Neuron compiler and runtime. These optimizations are currently deployed in Rufus production and are available to use in the Neuron SDK 2.18 and onward.

To reduce overall waiting time for customers to start seeing a response from Rufus, the team also developed an inference streaming architecture. With the high compute and memory load needed for LLM inference, the total time it takes to finish generating the full response for a customer query can take multiple seconds. With a streaming architecture, Rufus is able to return the tokens right after they’re generated. This optimization allows the customer to start consuming the response in less than 1 second. In addition, multiple services work together using gRPC connections to intelligently aggregate and enhance the streaming response in real time for customers.

As shown in the following figure, images and links are embedded in the response, which allow customers to engage and continue exploring with Rufus.

Scaling up

Although we have to maintain low latency for the best customer experience, it’s also crucial to scale the service throughput by achieving high hardware resource utilization. High hardware utilization makes sure accelerators don’t sit idle and needlessly increase costs. To optimize the inference system throughput, the team improved both single-host throughput as well as load balancing efficiency.

Load balancing for LLM inference is tricky due to following challenges. First, a single host can only handle a limited number of concurrent requests. Second, the end-to-end latency to complete one request can vary, spanning many seconds depending on the LLM response length.

To address the challenges, the team optimized throughput by considering both single-host throughput and throughput across many hosts using load balancing.

The team used the least outstanding requests (LOR) routing algorithm from ALB, increasing throughput by five times faster in comparison to an earlier baseline measurement. This allows each host to have enough time to process in-flight requests and stream back responses using a gRPC connection, without getting overwhelmed by multiple requests received at the same time. Rufus also collaborated with AWS and vLLM teams to improve single-host concurrency using vLLM integration with the Neuron SDK and NVIDIA Triton Inference Server.

Figure 1. ECS tasks scale horizontally hosting the Triton Inference Server and dependencies

Figure 1. ECS tasks scale horizontally hosting the Triton Inference Server and dependencies

With this integration, Rufus was able to benefit from a critical optimization: continuous batching. Continuous batching allows a single host to greatly increase throughput. In addition, continuous batching provides unique capabilities in comparison to other batch techniques, such as static batching. For example, when using static batching, the time to first token (TTFT) increases linearly with the number of requests in one batch. Continuous batching prioritizes the prefill stage for LLM inference, keeping TTFT under control even with more requests running at the same time. This helped Rufus provide a pleasant experience with low latency when generating the first response, and improve the single-host throughput to keep serving costs under control.

Conclusion

In this post, we discussed how Rufus is able to reliably deploy and serve its multi-billion-parameter LLM using the Neuron SDK with Inferentia2 and Trainium chips and AWS services. Rufus continues to evolve with advancements in generative AI and customer feedback and we encourage you to use Inferentia and Trainium.

Learn more about how we are innovating with generative AI across Amazon.


About the author

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time, he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.

RJ is an Engineer within Amazon. He builds and optimizes systems for distributed systems for training and works on optimizing adopting systems to reduce latency for ML Inference. Outside work, he is exploring using Generative AI for building food recipes.

Yang Zhou is a software engineer working on building and optimizing machine learning systems. His recent focus is enhancing the performance and cost efficiency of generative AI inference. Beyond work, he enjoys traveling and has recently discovered a passion for running long distances.

Adam (Hongshen) Zhao is a Software Development Manager at Amazon Stores Foundational AI. In his current role, Adam is leading Rufus Inference team to build GenAI inference optimization solutions and inference system at scale for fast inference at low cost. Outside work, he enjoys traveling with his wife and art creations.

Faqin Zhong is a software engineer at Amazon Stores Foundational AI, working on Large Language Model (LLM) inference infrastructure and optimizations. Passionate about Generative AI technology, Faqin collaborates with leading teams to drive innovations, making LLMs more accessible and impactful, ultimately enhancing customer experiences across diverse applications. Outside of work she enjoys cardio exercise and baking with her son.

Nicolas Trown is an engineer in Amazon Stores Foundational AI. His recent focus is lending his systems expertise across Rufus to aid Rufus Inference team and efficient utilization across the Rufus experience. Outside of work he enjoys spending time with his wife and day trips to nearby coast, Napa, and Sonoma areas.

Bing Yin is a director of science at Amazon Stores Foundational AI. He leads the effort to build LLMs that are specialized for shopping use cases and optimized for inference at Amazon scale. Outside of work, he enjoys running marathon races.

Read More

Exploring alternatives and seamlessly migrating data from Amazon Lookout for Vision

Exploring alternatives and seamlessly migrating data from Amazon Lookout for Vision

Amazon Lookout for Vision, the AWS service designed to create customized artificial intelligence and machine learning (AI/ML) computer vision models for automated quality inspection, will be discontinuing on October 31, 2025. New customers will not be able to access the service effective October 10, 2024, but existing customers will be able to use the service as normal until October 31, 2025. AWS will continue to support the service with security updates, bug fixes, and availability enhancements, but we do not plan to introduce new features for this service.

This post discusses some alternatives to Lookout for Vision and how you can export your data from Lookout for Vision to migrate to an alternate solution.

Alternatives to Lookout for Vision

If you’re interested in an alternative to Lookout for Vision, AWS has options for both buyers and builders.

For an out-of-the-box solution, the AWS Partner Network offers solutions from multiple partners. You can browse solutions on the Computer Vision for Quality Insights page in the AWS Solutions Library. These partner solutions include options for software, software as a service (SaaS) applications, managed solutions or custom implementations based on your needs. This approach provides a solution that addresses your use case without requiring you to have expertise in imaging, computer vision, AI, or application development. This typically provides the fastest time to value by taking advantage of the specialized expertise of the AWS Partners. The Solutions Library also has additional guidance to help you build solutions faster.

If you prefer to build your own solution, AWS offers AI tools and services to help you develop an AI-based computer vision inspection solution. Amazon SageMaker provides a set of tools to build, train, and deploy ML models for your use case with fully managed infrastructure, tools, and workflows. In addition to SageMaker enabling you to build your own models, Amazon SageMaker JumpStart offers built-in computer vision algorithms and pre-trained defect detection models that can be fine-tuned to your specific use case. This approach provides you the tools to accelerate your AI development while providing complete flexibility to build a solution that meets your exact requirements and integrates with your existing hardware and software infrastructure. This typically provides the lowest operating costs for a solution.

AWS also offers Amazon Bedrock, a fully managed service that offers a choice of high-performing generative AI foundation models (FMs), including models that can help build a defect detection model running in the cloud. This approach enables you to build a custom solution while using the power of generative AI to handle the custom computer vision model creation and some of the code generation to speed development, eliminating the need for full AI computer vision expertise. Amazon Bedrock provides the ability to analyze images for defects, compare performance of different models, and generate code for custom applications. This alternative is useful for use cases that don’t require low latency processing, providing faster time to value and lower development costs.

Migrating data from Lookout for Vision

To move existing data from Lookout for Vision to use in an alternative implementation, the Lookout for Vision SDK provides the capability to export a dataset from the service to an Amazon Simple Storage Service (Amazon S3) bucket. This procedure exports the training dataset, including manifest and dataset images, for a project to a destination Amazon S3 location that you specify. With the exported dataset and manifest file, you can use the same data that you used to create a Lookout for Vision model to create a model using SageMaker or Amazon Bedrock, or provide it to a partner to incorporate into their customizations for your use case.

Summary

Although Lookout for Vision is planned to shut down on October 31, 2025, AWS offers a powerful set of AI/ML services and solutions in the form of SageMaker tools to build custom models and generative AI with Amazon Bedrock to do customized inspection and generate code, in addition to a range of offerings from partners in the AWS Partner Network. Export tools enable you to effortlessly move your data from Lookout for Vision to an alternate solution if you so choose. You should explore these options to determine what works best for your specific needs.

For more details, refer to the following resources:


About the Author

Tim Westman is the Product Manager and Go-to-Market Lead for Edge Machine Learning, AWS. Tim leads the Product Management and Business Development for the Edge Machine Learning business at Amazon Web Services. In this role, he works with customers to help build computer vision solutions at the edge to solve complex operational challenges. Tim has more than 30 years of experience in sales, business development and product management roles for leading hardware and software companies, with the last 8 years specializing in AI and computer vision for IoT applications.

Read More

Unlock the knowledge in your Slack workspace with Slack connector for Amazon Q Business

Unlock the knowledge in your Slack workspace with Slack connector for Amazon Q Business

Amazon Q Business is a fully managed, generative AI-powered assistant that you can configure to answer questions, provide summaries, generate content, and complete tasks based on your enterprise data. Amazon Q Business offers over 40 built-in connectors to popular enterprise applications and document repositories, including Amazon Simple Storage Service (Amazon S3), Salesforce, Google Drive, Microsoft 365, ServiceNow, Gmail, Slack, Atlassian, and Zendesk and can help you create your generative AI solution with minimal configuration.

Nearly 100 thousand organizations use Slack to bring the right people together to securely collaborate with each other. A Slack workspace captures invaluable organizational knowledge in the form of the information that flows through it as the users communicate on it. Hence, it is valuable to make this knowledge quickly and securely available to the users.

In this post, we will demonstrate how to set up Slack connector for Amazon Q Business to sync communications from both public and private channels, reflective of user permissions. We will also guide you through the configurations needed on your Slack workspace. Additionally, you will learn how to configure the Amazon Q Business application and enable user authentication through AWS IAM Identity Center, which is a recommended service for managing a workforce’s access to AWS applications.

Data source overview

Amazon Q Business uses large language models (LLMs) to build a unified solution that connects multiple data sources. Typically, you’d need to use a natural language processing (NLP) technique called Retrieval Augmented Generation (RAG) for this. With RAG, generative AI enhances its responses by incorporating relevant information retrieved from a curated dataset. Amazon Q Business has a built-in managed RAG capability designed to reduce the undifferentiated heavy lifting involved in creating these systems. Typical of a RAG model, Amazon Q Business has two components: A retrieval component that retrieves relevant documents for the user query and a generation component that takes the query and the retrieved documents and then generates an answer to the query using an LLM.

A Slack workspace has multiple elements. It has public channels where workspace users can participate and private channels where only channel members can communicate with each other. Individuals can also directly communicate with each other in one-on-one conversations and in user groups. This communication is in the form of messages and threads of replies, with optional document attachments. Slack workspaces of active organizations are highly dynamic, with the content and collaboration evolving and growing in volume continuously.

The preceding figure shows the process flow of the solution. When you connect Amazon Q Business to a data source (in this case, Slack), what Amazon Q considers and crawls as a document varies by connector. For the Amazon Q Business Slack connector, each message, message attachment and channel post is considered a single document, However, Slack conversation threads that help you create organized discussions around specific messages are also considered and ingested as a single document, regardless of the number of participants or messages they contain.

Amazon Q Business crawls access control list (ACL) information attached to a document (user and group information) from your Slack instance. This information can be used to filter chat responses to the user’s document access level. The Slack connector supports token-based authentication. This could be a Slack bot user OAuth token or Slack user OAuth token. See the Slack connector overview to get the list of entities that are extracted, supported filters, sync modes, and file types.

User IDs (_user_id) exist in Slack on messages and channels where there are set access permissions. They are mapped from the user emails as the IDs in Slack.

To connect your data source connector to Amazon Q Business, you must give Amazon Q Business an IAM role that has the following permissions:

  • Permission to access the BatchPutDocument and BatchDeleteDocument operations to ingest documents.
  • Permission to access the User Store API operations to ingest user and group access control information from documents.
  • Permission to access your AWS Secrets Manager secret to authenticate your data source connector instance.
  • (Optional) If you’re using Amazon Virtual Private Cloud (Amazon VPC), permission to access your Amazon VPC.

Solution overview

In this solution, we will show you how to create a Slack workspace with users who perform various roles within the organization. We will then show you how to configure this workspace to define a set of scopes that are required by the Amazon Q Business Slack connector to index the user communication. This will be followed by the configuration of the Amazon Q Business application and a Slack data source. Based on the configuration, when the data source is synchronized, the connector either crawls and indexes the content from the workspace that was created on or before a specific date. The connector also collects and ingests ACL information for each indexed message and document. Thus, the search results of a query made by a user includes results only from those documents that the user is authorized to read.

Prerequisites

To build the Amazon Q Business connector for Slack, you need the following:

In Slack:

  • Create a Slack bot user OAuth token or Slack user OAuth token. You can choose either token to connect Amazon Q Business to your Slack data source. See the Slack documentation on access tokens for more information.
  • Note your Slack workspace team ID from your Slack workspace main page URL. For example, https://app.slack.com/client/T0123456789/... where T0123456789 is the team ID.
  • Add the OAuth scopes and read permissions.

In your AWS account:

  • Create an AWS Identity and Access Management (IAM) role for your data source and, if using the Amazon Q Business API, note the ARN of the IAM role.
  • Store your Slack authentication credentials in an AWS Secrets Manager secret and, if using the Amazon Q Business API, note the ARN of the secret.
  • Enable and configure an IAM Identity Center instance. Amazon Q Business integrates with IAM Identity Center as a gateway to manage user access to your Amazon Q Business application. We recommend enabling and pre-configuring an Identity Center instance before you begin to create your Amazon Q Business application. Identity Center is the recommended AWS service for managing human user access to AWS resources. Amazon Q Business supports both organization and account level Identity Center instances. See Setting up for Amazon Q Business for more information.

Configure your Slack workspace

You will create one user for each of the following roles: Administrator, Data scientist, Database administrator, Solutions architect and Generic.

User name Role
arnav_desai Admin
jane_doe Data Scientist
pat_candella DB Admin
mary_major Solutions Architect
john_stiles Generic User

To showcase the ACL propagation, you will create three public channels, #general, #customerwork, and #random, that any member can access including the Generic user. Also, one private channel, #anydepartment-project-private, that can be accessed only by the users arnav_desai, john_stiles, mary_major, and pat_candella.

To create a Slack app:

  1. Navigate to the Slack API Your Apps page and choose Create New App.
  2. Select From scratch. In the next screen, select the workspace to develop your app, and then choose Create an App.
  3. Give the Slack app a name and select a workspace to develop your app in. Then choose Create App.
  4. After you’ve created your app, select it and navigate to Features and choose OAuth & Permissions.
  5. Scroll down to Scopes > User Token Scopes and set the OAuth scope based on the user token scopes in Prerequisites for connecting Amazon Q Business to Slack.

Note: You can configure two types of scopes in a Slack workspace:

  1. Bot token scope: Only the messages to which it has been explicitly added are crawled by the bot token. It is employed to grant restricted access to specific messages only.
  2. User token scope: Only the data shared with the member is accessible to the user token, which acts as a representative of a Slack user.

For this example, so you can search on the conversations between users, you will use the user token scope.

  1. After the OAuth scope for yser token has been set up as described in the Slack prerequisites, scroll up to the section OAuth Tokens for your Workspace, and choose Install to Workspace, and then choose Allow.
  2. This will generate a user OAuth token. Copy this token to use when configuring the Amazon Q Business Slack connector.

Configure the data source using the Amazon Q Business Slack connector

In this section, you will create an Amazon Q Business application using the console.

To create an Amazon Q Business application

  1. In the AWS Management Console for Amazon Q Business, choose Create Application.
  2. Enter an Application Name, such as my-slack-workspace. Leave the Service access as the default value, and select AWS IAM Identity Center for Access Management . Enter a new Tag value as required and choose Create to the Amazon Q Business Application.
  3. Leave the default option of Use Native retriever selected for Retrievers, leave Enterprise as the Index provisioning and leave the default value of 1 as the Number of units. Each unit in Amazon Q Business index is 20,000 documents or 200 MB of extracted text (whichever comes first). Choose Next.
  4. Scroll down the list of available connectors and select Slack and then choose Next.

    1. Enter a Data source name and a Description to identify your data source and then enter the Slack workspace team ID to connect with Amazon Q Business.
    2. In the Authentication section, select Create and add a new secret.
    3. On the dialog box that appears, enter a Secret name followed by the User OAuth Slack token that was copied from the Slack workspace.
    4. For the IAM role, select Create a new service role (Recommended).
    5. In Sync scope, choose the following:
      • For select type of content to crawl, select All channels.
      • Select an appropriate date for Select crawl start date.
      • Leave the default value selected for Maximum file size as 50.
      • You can include specific Messages, such as bot messages or archived messages to sync.
      • Additionally, you can include up to 100 patterns to include or exclude filenames, types, or file paths to sync.

    6. For Sync mode, leave Full sync selected and for the Sync run schedule, select Run on demand.
    7. Leave the field mapping as is and choose Add data source.
    8. On the next page, choose Next.
  5. Add the five users you created earlier, who are a part of IAM Identity Center and the Slack workspace to the Amazon Q Business application. To add users to Identity Center, follow the instructions in Add users to your Identity Center directory. When done, choose Add groups and users and choose Assign.
  6. When a user is added, each user is assigned the default Q Business Pro For more information on different pricing tiers, see the Amazon Q Business pricing page.
  7. Choose Create application to finish creating the Amazon Q Business application.
  8. After the application and the data source are created, select the data source and then choose Sync now to start syncing documents from your data source.
  9. The sync process ingests the documents from your Slack workspace to your selections in the Slack connector configuration in Amazon Q Business. The following screenshot shows the results of a successful sync, indicated by the status of Completed.

Search with Amazon Q Business

Now, you’re ready to make a few queries in Amazon Q Business.

To search using Amazon Q Business:

  1. Navigate to the Web experience settings tab and click on the Deployed URL.
  2. For this demonstration, sign in as pat_candella who has the role of DB Admin.
  3. Enter the password for pat_candella and choose Sign in
  4. Upon successful sign-in, you will be signed in to Amazon Q Business.
  5. In the Slack workspace, there is a public channel, the #customerwork channel that all users are members of. The #customerwork Slack channel is being used to communicate about an upcoming customer engagement, as shown in the following figure.
  6. Post the first question to Amazon Q Business.
I am currently using Apache Kafka. Can you list high level steps involved in migration to Amazon MSK?

Note that the response includes citations that refer to the conversation as well as the content of the PDF that was attached to the conversation.

Security and privacy options with Slack data connector

Next, you will create a private channel called #anydepartment-project-private with four out of the five users—arnav_desai, john_stiles, mary_major and pat_candella—and verify that the messages exchanged in a private channel are not available to non-members like jane_doe. Note that after you create a new private channel, you need to manually re-run the sync on the data source.

The below screenshot shows the private slack channel with four out of five users and the slack conversation.

Testing security and privacy options with Slack data connector

  1. While signed in as pat_candella, who is part of the private #anydepartment-project-private channel, execute the following query:
    What is Amazon Kendra and which API do I use to query a Kendra index?

  2. Now, sign in as jane_doe, who is not a member of the #anydepartment-project-private channel and execute the same query.
  3. Amazon Q Business prevents jane_doe from getting insights from information within the private channels that they aren’t part of, based on the synced ACL information.

Indexing aggregated Slack threads

Slack organizes conversations into threads, which can involve multiple users and messages. The Amazon Q Business Slack connector treats each thread as a single document, regardless of the number of participants or messages it contains. This approach allows Amazon Q Business to ingest entire conversation threads as individual units, maximizing the amount of data that can be processed within a single index unit. As a result, you can efficiently incorporate more comprehensive conversational context into your Amazon Q Business system.

The figure that follows shows a conversation between pat_candella and jane_doe that includes six messages in a thread. The Slack connector aggregates this message thread as a single message, thus maximizing the use of an index unit.

Because the conversation thread is aggregated as a single document within the Amazon Q Business index, you can ask questions that pertain to a single conversation thread as shown in the following figure.

Troubleshooting the sync process

  • Why isn’t Amazon Q Business answering any of my questions?

If you aren’t getting answers to your questions from Amazon Q Business, verify the following:

  • Permissions – Document ACLs indexed by Amazon Q Business may not allow you to query certain data entities as demonstrated in our example. If this is the case, please reach out to your Slack workspace administrator to make sure that your user has access to required documents and repeat the sync process.
  • Data connector sync – A failed data source sync may prevent the documents from being indexed, meaning that Amazon Q Business would be unable to answer questions about the documents that failed to sync. Please refer to the official documentation to troubleshoot data source connectors.
  • I’m receiving access errors on Amazon Q Business application. What causes this?

See Troubleshooting Amazon Q Business identity and access to diagnose and fix common issues that you might encounter when working with Amazon Q and IAM.

  • How can I sync documents without ACLs?

Amazon Q Business supports crawling ACLs for document security by default. Turning off ACLs and identity crawling are no longer supported. If you want to index documents without ACLs, ensure that the documents are marked as public in your data source. Please refer to the official documentation, How Amazon Q Business connector for crawls Slack ACLs.

  • My connector is unable to sync. How can I monitor data source sync progress?

Amazon Q Business provides visibility into the data sync operations. Learn more about this feature in the AWS Machine Learning blog.

Additionally, as the sync process runs, you can monitor progress or debug failures by monitoring the Amazon CloudWatch logs that can be accessed from the Details section of the Sync run history.

A sample query to determine which documents or messages were indexed from a specific slack channel, C12AB34578, and logStream of SYNC_RUN_HISTORY_REPORT/xxxxxxxxxxxxxxxxxxxxxxxx would look like the following:

fields LogLevel, DocumentId, DocumentTitle, CrawlAction, ConnectorDocumentStatus.Status as ConnectorDocumentStatus, ErrorMsg, CrawlStatus.Status as CrawlStatus, SyncStatus.Status as SyncStatus, IndexStatus.Status as IndexStatus, SourceUri, Acl, Metadata, HashedDocumentId, @timestamp

| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/xxxxxxxxxxxxxxxxxxxxxxxx' and Metadata like /"stringValue":"C12AB34578"/

| sort @timestamp desc

| limit 10000

Choosing Run query displays the list of messages as the Amazon Q Business Index sync runs, as shown in the following figure.

Cleanup

To delete an Amazon Q Business application, you can use the console or the DeleteApplication API operation.

To delete an Amazon Q Business application using the console

  1. Sign in to the Amazon Q Business console.
  2. Select the respective the Amazon Q Business Application and choose
  3. Choose Delete
  4. In the dialog box that opens, enter Delete to confirm deletion, and then choose Delete.
  5. You are returned to the service console while your application is deleted. When the deletion process is complete, the console displays a message confirming successful deletion.

To delete the IAM Identity Center instance, see Delete your IAM Identity Center instance.

Conclusion

This blog post provides a step-by-step guide on setting up the Slack connector for Amazon Q Business, enabling you to seamlessly integrate data from your Slack workspace. Moreover, we highlighted the importance of data privacy and security, demonstrating how the connector adheres to the ACLs within your Slack workspace. This feature helps ensure that private channel conversations remain confidential and inaccessible to individuals who aren’t members of those channels. By following these steps and understanding the built-in security measures, you can use the power of Amazon Q Business while maintaining the integrity and privacy of your Slack workspace.

To learn more about the Amazon Q Business connector for Slack, see Connecting Slack to Amazon Q Business. You can automate all the showcased console operations through Amazon Q Business API’s, the AWS CLI and other applicable AWS SDKs.

If you choose to converse with Amazon Q Business using Slack direct messages (DMs) to ask questions and get answers based on company data or to get help creating new content such as email drafts, summarize attached files, and perform tasks, see Deploy a Slack gateway for Amazon Q, your business expert for information about how to bring Amazon Q, your business expert, to users in Slack.


About the Authors

Akshara Shah is a Senior Solutions Architect at Amazon Web Services. She provides strategic technical guidance to help customers design and build cloud solutions. She is currently focused on machine learning and AI technologies.

Roshan Thomas is a Senior Solutions Architect at Amazon Web Services. He is based in Melbourne, Australia and works closely with enterprise customers to accelerate their journey in the cloud. He is passionate about technology and helping customers architect and build solutions on AWS.

Read More

Transitioning off Amazon Lookout for Metrics 

Transitioning off Amazon Lookout for Metrics 

Amazon Lookout for Metrics is a fully managed service that uses machine learning (ML) to detect anomalies in virtually any time-series business or operational metrics—such as revenue performance, purchase transactions, and customer acquisition and retention rates—with no ML experience required. The service, which was launched in March 2021, predates several popular AWS offerings that have anomaly detection, such as Amazon OpenSearch, Amazon CloudWatch, AWS Glue Data Quality, Amazon Redshift ML, and Amazon QuickSight.

After careful consideration, we have made the decision to end support for Amazon Lookout for Metrics, effective October 10, 2025. In addition, as of today, new customer sign-ups are no longer available. Existing customers will be able to use the service as usual until October 10, 2025, when we will end support for Amazon Lookout for Metrics.

In this post, we provide an overview of the alternate AWS services that offer anomaly detection capabilities for customers to consider transitioning their workloads to.

AWS services with anomaly detection capabilities

We recommend customers use Amazon OpenSearch, Amazon CloudWatch, Amazon Redshift ML, Amazon QuickSight, or AWS Glue Data Quality services for their anomaly detection use cases as an alternative to Amazon Lookout for Metrics. These AWS services offer generally available, ML-powered anomaly detection capabilities that can be used out of the box without requiring any ML expertise. Following is a brief overview of each service.

Using Amazon OpenSearch for anomaly detection

Amazon OpenSearch Service features a highly performant, integrated anomaly detection engine that enables the real-time identification of anomalies in streaming data as well as in historical data. You can pair anomaly detection with built-in alerting in OpenSearch to send notifications when there is an anomaly. To start using OpenSearch for anomaly detection you first must index your data into OpenSearch, from there you can enable anomaly detection in OpenSearch Dashboards. To learn more, see the documentation.

Using Amazon CloudWatch for anomaly detection

Amazon CloudWatch supports creating anomaly detectors on specific Amazon CloudWatch Log Groups by applying statistical and ML algorithms to CloudWatch metrics. Anomaly detection alarms can be created based on a metric’s expected value. These types of alarms don’t have a static threshold for determining alarm state. Instead, they compare the metric’s value to the expected value based on the anomaly detection model. To start using CloudWatch anomaly detection, you first must ingest data into CloudWatch and then enable anomaly detection on the log group.

Using Amazon Redshift ML for anomaly detection

Amazon Redshift ML makes it easy to create, train, and apply machine learning models using familiar SQL commands in Amazon Redshift data warehouses. Anomaly detection can be done on your analytics data through Redshift ML by using the included XGBoost model type, local models, or remote models with Amazon SageMaker. With Redshift ML, you don’t have to be a machine learning expert and you pay only for the training cost of the SageMaker models. There are no additional costs to using Redshift ML for anomaly detection. To learn more, see the documentation.

Using Amazon QuickSight for anomaly detection

Amazon QuickSight is a fast, cloud-powered, business intelligence service that delivers insights to everyone in the organization. As a fully managed service, QuickSight lets customers create and publish interactive dashboards that include ML insights. QuickSight supports a highly performant, integrated anomaly detection engine that uses proven Amazon technology to continuously run ML-powered anomaly detection across millions of metrics to discover hidden trends and outliers in customers’ data. This tool allows customers to get deep insights that are often buried in the aggregates and not scalable with manual analysis. With ML-powered anomaly detection, customers can find outliers in their data without the need for manual analysis, custom development, or ML domain expertise. To learn more, see the documentation.

Using Amazon Glue Data Quality for anomaly detection

Data engineers and analysts can use AWS Glue Data Quality to measure and monitor their data. AWS Glue Data Quality uses a rule-based approach that works well for known data patterns and offers ML-based recommendations to help you get started. You can review the recommendations and augment rules from over 25 included data quality rules. To capture unanticipated, less obvious data patterns, you can enable anomaly detection. To use this feature, you can write rules or analyzers and then turn on anomaly detection in AWS Glue ETL. AWS Glue Data Quality collects statistics for columns specified in rules and analyzers, applies ML algorithms to detect anomalies, and generates visual observations explaining the detected issues. Customers can use recommended rules to capture the anomalous patterns and provide feedback to tune the ML model for more accurate detection. To learn more, see the blog post, watch the introductory video, or see the documentation.

Using Amazon SageMaker Canvas for anomaly detection (a beta feature)

The Amazon SageMaker Canvas team plans to provide support for anomaly detection use cases in Amazon SageMaker Canvas. We’ve created an AWS CloudFormation template-based solution to give customers early access to the underlying anomaly detection feature. Customers can use the CloudFormation template to bring up an application stack that receives time-series data from an Amazon Managed Streaming for Apache Kafka (Amazon MSK) streaming source and performs near-real-time anomaly detection in the streaming data. To learn more about the beta offering, see Anomaly detection in streaming time series data with online learning using Amazon Managed Service for Apache Flink.

Frequently asked questions

  1. What is the cutoff point for current customers?

We created an allow list of account IDs that have used Amazon Lookout for Metrics in the last 30 days and have active Amazon Lookout for Metrics resources, including detectors, within the service. If you are an existing customer and are having difficulties using the service, please reach out to us via AWS Customer Support for help.

  1. How will access change before the sunset date?

Current customers can do all the things they could previously. The only change is that non-current customers cannot create any new resources in Amazon Lookout for Metrics.

  1. What happens to my Amazon Lookout for Metrics resources after the sunset date?

After October 10, 2025, all references to AWS Lookout for Metrics models and resources will be deleted from Amazon Lookout for Metrics. You will not be able to discover or access Amazon Lookout for Metrics from your AWS Management Console and applications that call the Amazon Lookout for Metrics API will no longer work.

  1. Will I be billed for Amazon Lookout for Metrics resources remaining in my account after October 10, 2025?

Resources created by Amazon Lookout for Metrics internally will be deleted after October 10, 2025. Customers will be responsible for deleting the input data sources created by them, such as Amazon Simple Storage Service (Amazon S3) buckets, Amazon Redshift clusters, and so on.

  1. How do I delete my Amazon Lookout for Metrics resources?
  1. How can I export anomalies data before deleting the resources?

Anomalies data for each measure can be downloaded for a detector by using the Amazon Lookout for Metrics APIs for a particular detector. Exporting Anomalies explains how to connect to a detector, query for anomalies, and download them into a format for later use.

Conclusion

In this blog post, we have outlined methods to create anomaly detectors using alternates such as Amazon OpenSearch, Amazon CloudWatch, and a CloudFormation template-based solution.

Resource links:


About the Author

Nirmal Kumar is Sr. Product Manager for the Amazon SageMaker service. Committed to broadening access to AI/ML, he steers the development of no-code and low-code ML solutions. Outside work, he enjoys travelling and reading non-fiction.

Read More

Efficient Pre-training of Llama 3-like model architectures using torchtitan on Amazon SageMaker

Efficient Pre-training of Llama 3-like model architectures using torchtitan on Amazon SageMaker

This post is co-written with Less Wright and Wei Feng from Meta

Pre-training large language models (LLMs) is the first step in developing powerful AI systems that can understand and generate human-like text. By exposing models to vast amounts of diverse data, pre-training lays the groundwork for LLMs to learn general language patterns, world knowledge, and reasoning capabilities. This foundational process enables LLMs to perform a wide range of tasks without task-specific training, making them highly versatile and adaptable. Pre-training is essential for building a strong base of knowledge, which can then be refined and specialized through fine-tuning, transfer learning, or few-shot learning approaches.

In this post, we collaborate with the team working on PyTorch at Meta to showcase how the torchtitan library accelerates and simplifies the pre-training of Meta Llama 3-like model architectures. We showcase the key features and capabilities of torchtitan such as FSDP2, torch.compile integration, and FP8 support that optimize the training efficiency. We pre-train a Meta Llama 3 8B model architecture using torchtitan on Amazon SageMaker on p5.48xlarge instances, each equipped with 8 Nvidia H100 GPUs. We demonstrate a 38.23% performance speedup in the training throughput compared to the baseline without applying the optimizations (as shown in the following figure). Amazon SageMaker Model Training reduces the time and cost to train and tune machine learning (ML) models at scale without the need to manage infrastructure. You can take advantage of the highest-performing ML compute infrastructure currently available, and SageMaker can automatically scale infrastructure up or down, from one to thousands of GPUs.

To learn more, you can find our complete code sample on GitHub.

Introduction to torchtitan

torchtitan is a reference architecture for large-scale LLM training using native PyTorch. It aims to showcase PyTorch’s latest distributed training features in a clean, minimal code base. The library is designed to be simple to understand, use, and extend for different training purposes, with minimal changes required to the model code when applying various parallel processing techniques.

torchtitan offers several key features, including FSDP2 with per-parameter sharding, tensor parallel processing, selective layer and operator activation checkpointing, and distributed checkpointing. It supports pre-training of Meta Llama 3-like and Llama 2-like model architectures of various sizes and includes configurations for multiple datasets. The library provides straightforward configuration through TOML files and offers performance monitoring through TensorBoard. In the following sections, we highlight some of the key features of torchtitan.

Transitioning from FSDP1 to FSDP2

FSDP1 and FSDP2 are two approaches to fully sharded data parallel training. FSDP1 uses flat-parameter sharding, which flattens all parameters to 1D, concatenates them into a single tensor, pads it, and then chunks it across workers. This method offers bounded padding and efficient unsharded storage, but might not always allow optimal sharding for individual parameters. FSDP2, on the other hand, represents sharded parameters as DTensors sharded on dim-0, handling each parameter individually. This approach enables easier manipulation of parameters, for example per-weight learning rate, communication-free sharded state dicts, and simpler meta-device initialization. The transition from FSDP1 to FSDP2 reflects a shift towards more flexible and efficient parameter handling in distributed training, addressing limitations of the flat-parameter approach while potentially introducing new optimization opportunities.

torchtitan support for torch.compile

torch.compile is a key feature in PyTorch that significantly boosts model performance with minimal code changes. Through its just-in-time (JIT) compilation, it analyzes and transforms PyTorch code into more efficient kernels. torchtitan supports torch.compile, which delivers substantial speedups, especially for large models and complex architectures, by using techniques like operator fusion, memory planning, and automatic kernel selection. This is enabled by setting compile = true in the model’s TOML configuration file.

torchtitan support for FP8 linear operations

torchtitan provides support for FP8 (8-bit floating point) computation that significantly reduces memory footprint and enhances performance in LLM training. FP8 has two formats, E4M3 and E5M2, each optimized for different aspects of training. E4M3 offers higher precision, making it ideal for forward propagation, whereas E5M2, with its larger dynamic range, is better suited for backpropagation. When operating at a lower precision, FP8 has no impact on model accuracy, which we demonstrate by convergence comparisons of the Meta Llama 3 8B pre-training at 2,000 steps. FP8 support on torchtitan is through the torchao library, and we enable FP8 by setting enable_float8_linear = true in the model’s TOML configuration file.

torchtitan support for FP8 all-gather

This feature enables efficient communication of FP8 tensors across multiple GPUs, significantly reducing network bandwidth compared to bfloat16 all-gather operations. FP8 all-gather performs float8 casting before the all-gather operation, reducing the message size. Key to its efficiency is the combined absolute maximum (AMAX) AllReduce, which calculates AMAX for all float8 parameters in a single operation after the optimizer step, avoiding multiple small all-reduces. Similar to FP8 support, this also has no impact on model accuracy, which we demonstrate by convergence comparisons of the Meta Llama 3 8B pre-training.

Pre-training Meta Llama 3 8B with torchtitan on Amazon SageMaker

SageMaker training jobs offer several key advantages that enhance the pre-training process of Meta Llama 3-like model architectures with torchtitan. It provides a fully managed environment that simplifies large-scale distributed training across multiple instances, which is crucial for efficiently pre-training LLMs. SageMaker supports custom containers, which allows seamless integration of the torchtitan library and its dependencies, so all necessary components are readily available.

The built-in distributed training capabilities of SageMaker streamline the setup of multi-GPU and multi-node jobs, reducing the complexity typically associated with such configurations. Additionally, SageMaker integrates with TensorBoard, enabling real-time monitoring and visualization of training metrics and providing valuable insights into the pre-training process. With these features, researchers and practitioners can focus more on model development and optimization rather than infrastructure management, ultimately accelerating the iterative process of creating and refining custom LLMs.

Solution overview

In the following sections, we walk you through how to prepare a custom image with the torchtitan library, then configure a training job estimator function to launch a Meta Llama 3 8B model pre-training with the c4 dataset (Colossal Clean Crawled Corpus) on SageMaker. The c4 dataset is a large-scale web text corpus that has been cleaned and filtered to remove low-quality content. It is frequently used for pre-training language models.

Prerequisites

Before you begin, make sure you have the following requirements in place:

Build the torchtitan custom image

SageMaker BYOC (Bring Your Own Container) allows you to use custom Docker containers to train and deploy ML models. Typically, SageMaker provides built-in algorithms and preconfigured environments for popular ML frameworks. However, there may be cases where you have unique or proprietary algorithms, dependencies, or specific requirements that aren’t available in the built-in options, necessitating custom containers. In this case, we need to use the nightly versions of torch, torchdata, and the torchao package to train with FP8 precision.

We use the Amazon SageMaker Studio Image Build convenience package, which offers a command line interface (CLI) to simplify the process of building custom container images directly from SageMaker Studio notebooks. This tool eliminates the need for manual setup of Docker build environments, streamlining the workflow for data scientists and developers. The CLI automatically manages the underlying AWS services required for image building, such as Amazon Simple Storage Service (Amazon S3), AWS CodeBuild, and Amazon Elastic Container Registry (Amazon ECR), allowing you to focus on your ML tasks rather than infrastructure setup. It offers a simple command interface, handles packaging of Dockerfiles and container code, and provides the resulting image URI for use in SageMaker training and hosting.

Before getting started, make sure your AWS Identity and Access Management (IAM) execution role has the required IAM permissions and policies to use the Image Build CLI. For more information, see Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks. We have provided the Jupyter notebook to build the custom container in the GitHub repo.

Complete the following steps to build the custom image:

  1. Install the Image Build package with the following command:
! pip install sagemaker-studio-image-build
  1. To extend the pre-built image, you can use the included deep learning libraries and settings without having to create an image from scratch:
FROM 763104351884.dkr.ecr.${REGION}.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker
  1. Next, specify the libraries to install. You need the nightly versions of torch, torchdata, and the torchao libraries:
RUN pip install --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu121

RUN pip install --pre torchdata --index-url https://download.pytorch.org/whl/nightly

#install torchtitan dependencies
RUN pip install --no-cache-dir 
datasets>=2.19.0 
tomli>=1.1.0 
tensorboard 
sentencepiece 
tiktoken 
blobfile 
tabulate

#install torchao package for FP8 support
RUN pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121
#Display installed packages for reference
RUN pip freeze
  1. Use the Image Build CLI to build and push the image to Amazon ECR:

!sm-docker build --repository torchtitan:latest . You’re now ready to use this image for pre-training models with torchtitan in SageMaker.

Prepare your dataset (optional)

By default, the torchtitan library uses the allenai/c4 “en” dataset in its training configuration. This is streamed directly during training using the HuggingFaceDataset class. However, you may want to pre-train the Meta Llama 3 models on your own dataset residing in Amazon S3. For this purpose, we have prepared a sample Jupyter notebook to download the allenai/c4 “en” dataset from the Hugging Face dataset hub to an S3 bucket. We use the SageMaker InputDataConfiguration to load the dataset to our training instances in the later section. You can download the dataset with a SageMaker processing job available in the sample Jupyter notebook.

Launch your training with torchtitan

Complete the following steps to launch your training:

  1. Import the necessary SageMaker modules and retrieve your work environment details, such as AWS account ID and AWS Region. Make sure to upgrade the SageMaker SDK to the latest version. This might require a SageMaker Studio kernel restart.
%pip install --upgrade "sagemaker>=2.224"
%pip install sagemaker-experiments

import os
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch

role = get_execution_role()
print(f"SageMaker Execution Role: {role}")

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account: {account}")

session = boto3.session.Session()
region = session.region_name
print(f"AWS region: {region}")

sm_boto_client = boto3.client("sagemaker")
sagemaker_session = sagemaker.session.Session(boto_session=session)

default_bucket = sagemaker_session.default_bucket()
print("Default bucket for this session: ", default_bucket)
  1. Clone the torchtitan repository and prepare the training environment. Create a source directory and move the necessary dependencies from the torchtitan directory. This step makes sure you have all the required files for your training process.
git clone https://github.com/pytorch/torchtitan.git
mkdir torchtitan/src
!mv  torchtitan/torchtitan/ torchtitan/train_configs/ torchtitan/train.py  torchtitan/src/
  1. Use the following command to download the Meta Llama 3 tokenizer, which is essential for preprocessing your dataset. Provide your Hugging Face token.
    python torchtitan/src/torchtitan/datasets/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3-8B --tokenizer_path "original" --hf_token="YOUR_HF_TOKEN"

One of the key advantages of torchtitan is its straightforward configuration through TOML files. We modify the Meta Llama-3-8b TOML configuration file to enable monitoring and optimization features.

  1. Enable TensorBoard profiling for better insights into the training process:
[metrics]
log_freq = 10
enable_tensorboard = true
save_tb_folder = "/opt/ml/output/tensorboard"
  1. Enable torch.compile for improved performance:
compile = true
  1. Enable FP8 for more efficient computations:
float8]
enable_float8_linear = true
  1. Activate FP8 all-gather for optimized distributed training:
enable_fsdp_float8_all_gather= true
precompute_float8_dynamic_scale_for_fsdp = true
  1. To monitor the training progress, set up TensorBoard output. This allows you to visualize the training metrics in real time, providing valuable insights into how the model is learning.
from sagemaker.debugger import TensorBoardOutputConfig

LOG_DIR="/opt/ml/output/tensorboard"
tensorboard_output_config = TensorBoardOutputConfig(
s3_output_path=f"s3://sagemaker-{region}-{account}/tensorboard/",
container_local_output_path=LOG_DIR
)
  1. Set up the data channels for SageMaker training. Create TrainingInput objects that point to the preprocessed dataset in Amazon S3, so your model has access to the training data it needs.
#update the path below the s3 dataset path from running the previous Jupyter Notebook from Step 2
training_dataset_location = "<PATH-TO-DATASET>" 

s3_train_bucket = training_dataset_location

if s3_train_bucket != None:
   train = sagemaker.inputs.TrainingInput(s3_train_bucket, distribution="FullyReplicated", s3_data_type="S3Prefix")
   data_channels = {"train": train}

  1. With all the pieces in place, you’re ready to create the SageMaker PyTorch estimator. This estimator encapsulates all the configurations, including the custom container, hyperparameters, and resource allocations.

import os

from time import gmtime, strftime

hyperparameters = {
   "config_file": "train_configs/llama3_8b.toml"
}
timestamp = strftime("%Y-%m-%d-%H-%M", gmtime())


estimator = PyTorch(
   base_job_name=f'llama3-8b-{timestamp}',
   entry_point="train.py",
   image_uri="<PATH-TO-IMAGE-URI>",
   source_dir=os.path.join(os.getcwd(), "src"),
   role=role,
   instance_type="ml.p5.48xlarge",
   volume_size=800,
   instance_count=4,
   hyperparameters=hyperparameters,
   use_spot_instances = False,
   sagemaker_session=sagemaker_session,
   tensorboard_output_config=tensorboard_output_config,
   distribution={
   'torch_distributed': {'enabled': True},
   },
  
)
  1. Initiate the model training on SageMaker:

estimator.fit(inputs=data_channels)

Performance numbers

The following table summarizes the performance numbers for the various training runs with different optimizations.

Setup Configuration TOML Configuration

Throughput

(Tokens per Second)

Speedup Over

Baseline

LLama3 – 8B pre-training on 4 x p5.48xlarge instances

(32 NVIDIA H100 GPUs)

Baseline Default Configuration 6475
torch.compile compile = true 7166 10.67%
FP8 linear

compile = true

enable_float8_linear = true

8624 33.19%
FP8 all-gather

compile = true

enable_float8_linear = true

enable_fsdp_float8_all_gather= true

precompute_float8_dynamic_scale_for_fsdp = true

8950 38.23%

The performance results show clear optimization progress in Meta Llama 3 8B pre-training. torch.compile() delivered an 10.67% speedup, and FP8 linear operations tripled this to 33%. Adding FP8 all-gather further increased the speedup to 38.23% over the baseline. This progression demonstrates how combining optimization strategies significantly enhances training efficiency.

The following figure illustrates the stepwise performance gains for Meta Llama 3 8B pre-training on torchtitan with the optimizations.

These optimizations didn’t affect the model’s training quality. The loss curves for all optimization levels, including the baseline, torch.compile(), FP8 linear, and FP8 all-gather configurations, remained consistent throughout the training process, as shown in the following figure.

Loss curves with different configurations

The following table showcases the consistent loss value with the different configurations.

Configuration Loss After 2,000 Steps
Baseline 3.602
Plus torch.compile 3.601
Plus FP8 3.612
Plus FP8 all-gather 3.607

Clean up

After you complete your training experiments, clean up your resources to avoid unnecessary charges. You can start by deleting any unused SageMaker Studio resources. Next, remove the custom container image from Amazon ECR by deleting the repository you created. If you ran the optional step to use your own dataset, delete the S3 bucket where this data was stored.

Conclusion

In this post, we demonstrated how to efficiently pre-train Meta Llama 3 models using the torchtitan library on SageMaker. With torchtitan’s advanced optimizations, including torch.compile, FP8 linear operations, and FP8 all-gather, we achieved a 38.23% acceleration in Meta Llama 3 8B pre-training without compromising the model’s accuracy.

SageMaker simplified the large-scale training by offering seamless integration with custom containers, effortless scaling across multiple instances, built-in support for distributed training, and integration with TensorBoard for real-time monitoring.

Pre-training is a crucial step in developing powerful and adaptable LLMs that can effectively tackle a wide range of tasks and applications. By combining the latest PyTorch distributed training features in torchtitan with the scalability and flexibility of SageMaker, organizations can use their proprietary data and domain expertise to create robust and high-performance AI models. Get started by visiting the GitHub repository for the complete code example and optimize your LLM pre-training workflow.

Special thanks

Special thanks to Gokul Nadathur (Engineering Manager at Meta), Gal Oshri (Principal Product Manager Technical at AWS) and Janosch Woschitz (Sr. ML Solution Architect at AWS) for their support to the launch of this post.


About the Authors

Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS.He helps AWS customers—from small startups to large enterprises—train and deploy foundation models efficiently on AWS. He is passionate about computational optimization problems and improving the performance of AI workloads.

Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. He serves as a voting member of the PyTorch Foundation Governing Board, where he contributes to the strategic advancement of open-source deep learning frameworks. At AWS, Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.

Less Wright is an AI/Partner Engineer in PyTorch. He works on Triton/CUDA kernels (Accelerating Dequant with SplitK work decomposition); paged, streaming, and quantized optimizers; and PyTorch Distributed (PyTorch FSDP).

Wei Feng is a Software Engineer on the PyTorch distributed team. He has worked on float8 all-gather for FSDP2, TP (Tensor Parallel) in TorchTitan, and 4-bit quantization for distributed QLoRA in TorchTune. He is also a core maintainer of FSDP2.

Read More

Time series forecasting with Amazon SageMaker AutoML

Time series forecasting with Amazon SageMaker AutoML

Time series forecasting is a critical component in various industries for making informed decisions by predicting future values of time-dependent data. A time series is a sequence of data points recorded at regular time intervals, such as daily sales revenue, hourly temperature readings, or weekly stock market prices. These forecasts are pivotal for anticipating trends and future demands in areas such as product demand, financial markets, energy consumption, and many more.

However, creating accurate and reliable forecasts poses significant challenges because of factors such as seasonality, underlying trends, and external influences that can dramatically impact the data. Additionally, traditional forecasting models often require extensive domain knowledge and manual tuning, which can be time-consuming and complex.

In this blog post, we explore a comprehensive approach to time series forecasting using the Amazon SageMaker AutoMLV2 Software Development Kit (SDK). SageMaker AutoMLV2 is part of the SageMaker Autopilot suite, which automates the end-to-end machine learning workflow from data preparation to model deployment. Throughout this blog post, we will be talking about AutoML to indicate SageMaker Autopilot APIs, as well as Amazon SageMaker Canvas AutoML capabilities. We’ll walk through the data preparation process, explain the configuration of the time series forecasting model, detail the inference process, and highlight key aspects of the project. This methodology offers insights into effective strategies for forecasting future data points in a time series, using the power of machine learning without requiring deep expertise in model development. The code for this post can be found in the GitHub repo.

The following diagram depicts the basic AutoMLV2 APIs, all of which are relevant to this post. The diagram shows the workflow for building and deploying models using the AutoMLV2 API. In the training phase, CSV data is uploaded to Amazon S3, followed by the creation of an AutoML job, model creation, and checking for job completion. The deployment phase allows you to choose between real-time inference via an endpoint or batch inference using a scheduled transform job that stores results in S3.

Basic AutoMLV2 API's

1. Data preparation

The foundation of any machine learning project is data preparation. For this project, we used a synthetic dataset containing time series data of product sales across various locations, focusing on attributes such as product code, location code, timestamp, unit sales, and promotional information. The dataset can be found in an Amazon-owned, public Amazon Simple Storage Service (Amazon S3) dataset.

When preparing your CSV file for input into a SageMaker AutoML time series forecasting model, you must ensure that it includes at least three essential columns (as described in the SageMaker AutoML V2 documentation):

  1. Item identifier attribute name: This column contains unique identifiers for each item or entity for which predictions are desired. Each identifier distinguishes the individual data series within the dataset. For example, if you’re forecasting sales for multiple products, each product would have a unique identifier.
  2. Target attribute name: This column represents the numerical values that you want to forecast. These could be sales figures, stock prices, energy usage amounts, and so on. It’s crucial that the data in this column is numeric because the forecasting models predict quantitative outcomes.
  3. Timestamp attribute name: This column indicates the specific times when the observations were recorded. The timestamp is essential for analyzing the data in a chronological context, which is fundamental to time series forecasting. The timestamps should be in a consistent and appropriate format that reflects the regularity of your data (for example, daily or hourly).

All other columns in the dataset are optional and can be used to include additional time-series related information or metadata about each item. Therefore, your CSV file should have columns named according to the preceding attributes (item identifier, target, and timestamp) as well as any other columns needed to support your use case For instance, if your dataset is about forecasting product demand, your CSV might look something like this:

  • Product_ID (item identifier): Unique product identifiers.
  • Sales (target): Historical sales data to be forecasted.
  • Date (timestamp): The dates on which sales data was recorded.

The process of splitting the training and test data in this project uses a methodical and time-aware approach to ensure that the integrity of the time series data is maintained. Here’s a detailed overview of the process:

Ensuring timestamp integrity

The first step involves converting the timestamp column of the input dataset to a datetime format using pd.to_datetime. This conversion is crucial for sorting the data chronologically in subsequent steps and for ensuring that operations on the timestamp column are consistent and accurate.

Sorting the data

The sorted dataset is critical for time series forecasting, because it ensures that data is processed in the correct temporal order. The input_data DataFrame is sorted based on three columns: product_code, location_code, and timestamp. This multi-level sort guarantees that the data is organized first by product and location, and then chronologically within each product-location grouping. This organization is essential for the logical partitioning of data into training and test sets based on time.

Splitting into training and test sets

The splitting mechanism is designed to handle each combination of product_code and location_code separately, respecting the unique temporal patterns of each product-location pair. For each group:

  • The initial test set is determined by selecting the last eight timestamps (yellow + green below). This subset represents the most recent data points that are candidates for testing the model’s forecasting ability.
  • The final test set is refined by removing the last four timestamps from the initial test set, resulting in a test dataset that includes the four timestamps immediately preceding the latest data (green below). This strategy ensures the test set is representative of the near-future periods the model is expected to predict, while also leaving out the most recent data to simulate a realistic forecasting scenario.
  • The training set comprises the remaining data points, excluding the last eight timestamps (blue below). This ensures the model is trained on historical data that precedes the test period, avoiding any data leakage and ensuring that the model learns from genuinely past observations.

This process is visualized in the following figure with an arbitrary value on the Y axis and the days of February on the X axis.

Time series data split

The test dataset is used to evaluate the performance of the trained model and compute various loss metrics, such as mean absolute error (MAE) and root-mean-squared error (RMSE). These metrics quantify the model’s accuracy in forecasting the actual values in the test set, providing a clear indication of the model’s quality and its ability to make accurate predictions. The evaluation process is detailed in the “Inference: Batch, real-time, and asynchronous” section, where we discuss the comprehensive approach to model evaluation and conditional model registration based on the computed metrics.

Creating and saving the datasets

After the data for each product-location group is categorized into training and test sets, the subsets are aggregated into comprehensive training and test DataFrames using pd.concat. This aggregation step combines the individual DataFrames stored in train_dfs and test_dfs lists into two unified DataFrames:

  • train_df for training data
  • test_df for testing data

Finally, the DataFrames are saved to CSV files (train.csv for training data and test.csv for test data), making them accessible for model training and evaluation processes. This saving step not only facilitates a clear separation of data for modelling purposes but also enables reproducibility and sharing of the prepared datasets.

Summary

This data preparation strategy meticulously respects the chronological nature of time series data and ensures that the training and test sets are appropriately aligned with real-world forecasting scenarios. By splitting the data based on the last known timestamps and carefully excluding the most recent periods from the training set, the approach mimics the challenge of predicting future values based on past observations, thereby setting the stage for a robust evaluation of the forecasting model’s performance.

2. Training a model with AutoMLV2

SageMaker AutoMLV2 reduces the resources needed to train, tune, and deploy machine learning models by automating the heavy lifting involved in model development. It provides a straightforward way to create high-quality models tailored to your specific problem type, be it classification, regression, or forecasting, among others. In this section, we delve into the steps to train a time series forecasting model with AutoMLV2.

Step 1: Define the tine series forecasting configuration

The first step involves defining the problem configuration. This configuration guides AutoMLV2 in understanding the nature of your problem and the type of solution it should seek, whether it involves classification, regression, time-series classification, computer vision, natural language processing, or fine-tuning of large language models. This versatility is crucial because it allows AutoMLV2 to adapt its approach based on the specific requirements and complexities of the task at hand. For time series forecasting, the configuration includes details such as the frequency of forecasts, the horizon over which predictions are needed, and any specific quantiles or probabilistic forecasts. Configuring the AutoMLV2 job for time series forecasting involves specifying parameters that would best use the historical sales data to predict future sales.

The AutoMLTimeSeriesForecastingConfig is a configuration object in the SageMaker AutoMLV2 SDK designed specifically for setting up time series forecasting tasks. Each argument provided to this configuration object tailors the AutoML job to the specifics of your time series data and the forecasting objectives.

time_series_config = AutoMLTimeSeriesForecastingConfig(
    forecast_frequency='W',
    forecast_horizon=4,
    item_identifier_attribute_name='product_code',
    target_attribute_name='unit_sales',
    timestamp_attribute_name='timestamp',
    ...
)

The following is a detailed explanation of each configuration argument used in your time series configuration:

  • forecast_frequency
    • Description: Specifies how often predictions should be made.
    • Value ‘W’: Indicates that forecasts are expected on a weekly basis. The model will be trained to understand and predict data as a sequence of weekly observations. Valid intervals are an integer followed by Y (year), M (month), W (week), D (day), H (hour), and min (minute). For example, 1D indicates every day and 15min indicates every 15 minutes. The value of a frequency must not overlap with the next larger frequency. For example, you must use a frequency of 1H instead of 60min.
  • forecast_horizon
    • Description: Defines the number of future time-steps the model should predict.
    • Value 4: The model will forecast four time-steps into the future. Given the weekly frequency, this means the model will predict the next four weeks of data from the last known data point.
  • forecast_quantiles
    • Description: Specifies the quantiles at which to generate probabilistic forecasts.
    • Values [p50,p60,p70,p80,p90]: These quantiles represent the 50th, 60th, 70th, 80th, and 90th percentiles of the forecast distribution, providing a range of possible outcomes and capturing forecast uncertainty. For instance, the p50 quantile (median) might be used as a central forecast, while the p90 quantile provides a higher-end forecast, where 90% of the actual data is expected to fall below the forecast, accounting for potential variability.
  • filling
    • Description: Defines how missing data should be handled before training; specifying filling strategies for different scenarios and columns.
    • Value filling_config: This should be a dictionary detailing how to fill missing values in your dataset, such as filling missing promotional data with zeros or specific columns with predefined values. This ensures the model has a complete dataset to learn from, improving its ability to make accurate forecasts.
  • item_identifier_attribute_name
    • Description: Specifies the column that uniquely identifies each time series in the dataset.
    • Value ’product_code’: This setting indicates that each unique product code represents a distinct time series. The model will treat data for each product code as a separate forecasting problem.
  • target_attribute_name
    • Description: The name of the column in your dataset that contains the values you want to predict.
    • Value unit_sales: Designates the unit_sales column as the target variable for forecasts, meaning the model will be trained to predict future sales figures.
  • timestamp_attribute_name
    • Description: The name of the column indicating the time point for each observation.
    • Value ‘timestamp’: Specifies that the timestamp column contains the temporal information necessary for modeling the time series.
  • grouping_attribute_names
    • Description: A list of column names that, in combination with the item identifier, can be used to create composite keys for forecasting.
    • Value [‘location_code’]: This setting means that forecasts will be generated for each combination of product_code and location_code. It allows the model to account for location-specific trends and patterns in sales data.

The configuration provided instructs the SageMaker AutoML to train a model capable of weekly sales forecasts for each product and location, accounting for uncertainty with quantile forecasts, handling missing data, and recognizing each product-location pair as a unique series. This detailed setup aims to optimize the forecasting model’s relevance and accuracy for your specific business context and data characteristics.

Step 2: Initialize the AutoMLV2 job

Next, initialize the AutoMLV2 job by specifying the problem configuration, the AWS role with permissions, the SageMaker session, a base job name for identification, and the output path where the model artifacts will be stored.

automl_sm_job = AutoMLV2(
    problem_config=time_series_config,
    role=role,
    sagemaker_session=sagemaker_session,
    base_job_name='time-series-forecasting-job',
    output_path=f's3://{bucket}/{prefix}/output'
)

Step 3: Fit the model

To start the training process, call the fit method on your AutoMLV2 job object. This method requires specifying the input data’s location in Amazon S3 and whether SageMaker should wait for the job to complete before proceeding further. During this step, AutoMLV2 will automatically pre-process your data, select algorithms, train multiple models, and tune them to find the best solution.

automl_sm_job.fit(
    inputs=[AutoMLDataChannel(s3_data_type='S3Prefix', s3_uri=train_uri, channel_type='training')],
    wait=True,
    logs=True
)

Please note that model fitting may take several hours, depending on the size of your dataset and compute budget. A larger compute budget allows for more powerful instance types, which can accelerate the training process. In this situation, provided you’re not running this code as part of the provided SageMaker notebook (which handles the order of code cell processing correctly), you will need to implement some custom code that monitors the training status before retrieving and deploying the best model.

3. Deploying a model with AutoMLV2

Deploying a machine learning model into production is a critical step in your machine learning workflow, enabling your applications to make predictions from new data. SageMaker AutoMLV2 not only helps build and tune your models but also provides a seamless deployment experience. In this section, we’ll guide you through deploying your best model from an AutoMLV2 job as a fully managed endpoint in SageMaker.

Step 1: Identify the best model and extract name

After your AutoMLV2 job completes, the first step in the deployment process is to identify the best performing model, also known as the best candidate. This can be achieved by using the best_candidate method of your AutoML job object. You can either use this method immediately after fitting the AutoML job or specify the job name explicitly if you’re operating on a previously completed AutoML job.

# Option 1: Directly after fitting the AutoML job
best_candidate = automl_sm_job.best_candidate()

# Option 2: Specifying the job name directly
best_candidate = automl_sm_job.best_candidate(job_name='your-auto-ml-job-name')

best_candidate_name = best_candidate['CandidateName']

Step 2: Create a SageMaker model

Before deploying, create a SageMaker model from the best candidate. This model acts as a container for the artifacts and metadata necessary to serve predictions. Use the create_model method of the AutoML job object to complete this step.

endpoint_name = f"ep-{best_candidate_name}-automl-ts"

# Create a SageMaker model from the best candidate
automl_sm_model = automl_sm_job.create_model(name=best_candidate_name, candidate=best_candidate)

4. Inference: Batch, real-time, and asynchronous

For deploying the trained model, we explore batch, real-time, and asynchronous inference methods to cater to different use cases.

The following figure is a decision tree to help you decide what type of endpoint to use. The diagram outlines a decision-making process for selecting between batch, asynchronous, or real-time inference endpoints. Starting with the need for immediate responses, it guides you through considerations like the size of the payload and the computational complexity of the model. Depending on these factors, you can choose a faster option with lower computational requirements or a slower batch process for larger datasets.

Decisioin tree for selecting between batch, asynchronous, or real-time inference endpoints

Batch inference using SageMaker pipelines

  • Usage: Ideal for generating forecasts in bulk, such as monthly sales predictions across all products and locations.
  • Process: We used SageMaker’s batch transform feature to process a large dataset of historical sales data, outputting forecasts for the specified horizon.

The inference pipeline used for batch inference demonstrates a comprehensive approach to deploying, evaluating, and conditionally registering a machine learning model for time series forecasting using SageMaker. This pipeline is structured to ensure a seamless flow from data preprocessing, through model inference, to post-inference evaluation and conditional model registration. Here’s a detailed breakdown of its construction:

  • Batch tranform step
    • Transformer Initialization: A Transformer object is created, specifying the model to use for batch inference, the compute resources to allocate, and the output path for the results.
    • Transform step creation: This step invokes the transformer to perform batch inference on the specified input data. The step is configured to handle data in CSV format, a common choice for structured time series data.
  • Evaluation step
    • Processor setup: Initializes an SKLearn processor with the specified role, framework version, instance count, and type. This processor is used for the evaluation of the model’s performance.
    • Evaluation processing: Configures the processing step to use the SKLearn processor, taking the batch transform output and test data as inputs. The processing script (evaluation.py) is specified here, which will compute evaluation metrics based on the model’s predictions and the true labels.
    • Evaluation strategy: We adopted a comprehensive evaluation approach, using metrics like mean absolute error (MAE) and root-means squared error (RMSE) to quantify the model’s accuracy and adjusting the forecasting configuration based on these insights.
    • Outputs and property files: The evaluation step produces an output file (evaluation_metrics.json) that contains the computed metrics. This file is stored in Amazon S3 and registered as a property file for later access in the pipeline.
  • Conditional model registration
    • Model metrics setup: Defines the model metrics to be associated with the model package, including statistics and explainability reports sourced from specified Amazon S3 URIs.
    • Model registration: Prepares for model registration by specifying content types, inference and transform instance types, model package group name, approval status, and model metrics.
    • Conditional registration step: Implements a condition based on the evaluation metrics (for example, MAE). If the condition (for example, MAE is greater than or equal to threshold) is met, the model is registered; otherwise, the pipeline concludes without model registration.
  • Pipeline creation and runtime
    • Pipeline definition: Assembles the pipeline by naming it and specifying the sequence of steps to run: batch transform, evaluation, and conditional registration.
    • Pipeline upserting and runtime: The pipeline.upsert method is called to create or update the pipeline based on the provided definition, and pipeline.start() runs the pipeline.

The following figure is an example of the SageMaker Pipeline directed acyclic graph (DAG).

SageMaker Pipeline directed acyclic graph (DAG) for this problem.

This pipeline effectively integrates several stages of the machine learning lifecycle into a cohesive workflow, showcasing how Amazon SageMaker can be used to automate the process of model deployment, evaluation, and conditional registration based on performance metrics. By encapsulating these steps within a single pipeline, the approach enhances efficiency, ensures consistency in model evaluation, and streamlines the model registration process—all while maintaining the flexibility to adapt to different models and evaluation criteria.

Inferencing with Amazon SageMaker Endpoint in (near) real-time

But what if you want to run inference in real-time or asynchronously? SageMaker real-time endpoint inference offers the capability to deliver immediate predictions from deployed machine learning models, crucial for scenarios demanding quick decision making. When an application sends a request to a SageMaker real-time endpoint, it processes the data in real time and returns the prediction almost immediately. This setup is optimal for use cases that require near-instant responses, such as personalized content delivery, immediate fraud detection, and live anomaly detection.

  • Usage: Suited for on-demand forecasts, such as predicting next week’s sales for a specific product at a particular location.
  • Process: We deployed the model as a SageMaker endpoint, allowing us to make real-time predictions by sending requests with the required input data.

Deployment involves specifying the number of instances and the instance type to serve predictions. This step creates an HTTPS endpoint that your applications can invoke to perform real-time predictions.

# Deploy the model to a SageMaker endpoint
predictor = automl_sm_model.deploy(initial_instance_count=1, endpoint_name=endpoint_name, instance_type='ml.m5.xlarge')

The deployment process is asynchronous, and SageMaker takes care of provisioning the necessary infrastructure, deploying your model, and ensuring the endpoint’s availability and scalability. After the model is deployed, your applications can start sending prediction requests to the endpoint URL provided by SageMaker.

While real-time inference is suitable for many use cases, there are scenarios where a slightly relaxed latency requirement can be beneficial. SageMaker Asynchronous Inference provides a queue-based system that efficiently handles inference requests, scaling resources as needed to maintain performance. This approach is particularly useful for applications that require processing of larger datasets or complex models, where the immediate response is not as critical.

  • Usage: Examples include generating detailed reports from large datasets, performing complex calculations that require significant computational time, or processing high-resolution images or lengthy audio files. This flexibility makes it a complementary option to real-time inference, especially for businesses that face fluctuating demand and seek to maintain a balance between performance and cost.
  • Process: The process of using asynchronous inference is straightforward yet powerful. Users submit their inference requests to a queue, from which SageMaker processes them sequentially. This queue-based system allows SageMaker to efficiently manage and scale resources according to the current workload, ensuring that each inference request is handled as promptly as possible.

Clean up

To avoid incurring unnecessary charges and to tidy up resources after completing the experiments or running the demos described in this post, follow these steps to delete all deployed resources:

  1. Delete the SageMaker endpoints: To delete any deployed real-time or asynchronous endpoints, use the SageMaker console or the AWS SDK. This step is crucial as endpoints can accrue significant charges if left running.
  2. Delete the SageMaker Pipeline: If you have set up a SageMaker Pipeline, delete it to ensure that there are no residual executions that might incur costs.
  3. Delete S3 artifacts: Remove all artifacts stored in your S3 buckets that were used for training, storing model artifacts, or logging. Ensure you delete only the resources related to this project to avoid data loss.
  4. Clean up any additional resources: Depending on your specific implementation and additional setup modifications, there may be other resources to consider, such as roles or logs. Check your AWS Management Console for any resources that were created and delete them if they are no longer needed.

Conclusion

This post illustrates the effectiveness of Amazon SageMaker AutoMLV2 for time series forecasting. By carefully preparing the data, thoughtfully configuring the model, and using both batch and real-time inference, we demonstrated a robust methodology for predicting future sales. This approach not only saves time and resources but also empowers businesses to make data-driven decisions with confidence.

If you’re inspired by the possibilities of time series forecasting and want to experiment further, consider exploring the SageMaker Canvas UI. SageMaker Canvas provides a user-friendly interface that simplifies the process of building and deploying machine learning models, even if you don’t have extensive coding experience.

Visit the SageMaker Canvas page to learn more about its capabilities and how it can help you streamline your forecasting projects. Begin your journey towards more intuitive and accessible machine learning solutions today!


About the Authors

Nick McCarthy is a Senior Machine Learning Engineer at AWS, based in London. He has worked with AWS clients across various industries including healthcare, finance, sports, telecoms and energy to accelerate their business outcomes through the use of AI/ML. Outside of work he loves to spend time travelling, trying new cuisines and reading about science and technology. Nick has a Bachelors degree in Astrophysics and a Masters degree in Machine Learning.

Davide Gallitelli is a Senior Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Read More

Automate user on-boarding for financial services with a digital assistant powered by Amazon Bedrock

Automate user on-boarding for financial services with a digital assistant powered by Amazon Bedrock

In this post, we present a solution that harnesses the power of generative AI to streamline the user onboarding process for financial services through a digital assistant. Onboarding new customers in the banking industry is a crucial step in the customer journey, involving a series of activities designed to fulfill know your customer (KYC) requirements, conduct necessary verifications, and introduce them to the bank’s products or services. Traditionally, customer onboarding has been a tedious and heavily manual process. Our solution provides practical guidance on addressing this challenge by using a generative AI assistant on AWS.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock, we build a digital assistant that automates document processing, identity verifications, and engages customers through conversational interactions. As a result, customers can be onboarded in a matter of minutes through secure, automated workflows. In this post we provide you a solution and the accompanying code that banks can use to dramatically enhance the customer experience and establish a strong customer relationship from the outset.

Challenges with traditional onboarding

The traditional onboarding process for banks faces challenges in the current digital landscape because many institutions don’t have fully automated account-opening systems. While customers in other sectors have access to intelligent assistants, those in banking often encounter legacy processes. As the financial services industry adapts to changing consumer expectations, there’s a need to address the demand for instant and 24/7 availability of services.

The challenges associated with the manual onboarding process include but aren’t limited to, the following:

  • Time-consuming paperwork – New customers are asked to manually fill out extensive paperwork including account opening forms, disclosures, and so on. Reviewing physical documents also takes up valuable staff time. This lengthy paperwork process can result in slow onboarding and a poor customer experience.
  • Security risks – Paper documents and in-person ID verification lack security compared to digital processes because of their susceptibility to tampering, loss, and lack of traceability. For example, there’s a greater risk of identity theft and fraud with physical documents, because they can be altered or misplaced without leaving an audit trail.
  • Accessibility issues – Requiring in-person account opening at branches can create accessibility challenges for many customers, including senior citizens and disabled individuals.
  • Limited service hours – The account opening process is available only during branch operating hours, which limits the timeframe when customers can complete the onboarding process. This constraint impacts the flexibility for customers to initiate account opening at their preferred time.
  • High costs – Manual paperwork processing and in-person verification are labor-intensive tasks that require significant staff time and resources, leading to high operational costs.

AI-powered services enable automated, secure, and compliant processes for self-service account opening. Providing onboarding experiences aligned with current digital standards might offer a competitive edge for banks in the future.

Solution overview

The solution allows users to open bank accounts remotely through a conversational interface, eliminating the need to visit a physical branch. We created a digital assistant named Penny to guide users through the process, including uploading KYC documents and facilitating identity verification using document scanning and facial recognition. The approach uses Retrieval Augmented Generation (RAG), which combines text generation capabilities with database querying to provide contextually relevant responses to customer inquiries. Implementing digital onboarding reduces the accessibility barriers present in traditional manual account opening processes. The code for this solution is available in a GitHub repository.

The brain of our application is a custom LangChain Agent. When a user wants to open a new bank account, the agent will help them complete the onboarding process using preconfigured stages corresponding to each onboarding step. Each stage might use a LangChain tool, allowing for the automation and orchestration of onboarding. These tools call on AWS service APIs for the required functionality.

The following figure represents the high-level architecture of the proposed solution.

User on-boarding architecture diagram

The flow of the application is as follows:

  1. Users access the frontend website hosted within AWS Amplify. AWS Amplify is an end-to-end solution that enables frontend web developers to build and deploy secure, scalable full stack applications.
  2. The website invokes an Amazon CloudFront endpoint to interact with the digital assistant, Penny, which is containerized and deployed in AWS Fargate. Fargate is a serverless compute engine for containers that manages and scales your containers for you, compatible with Amazon Elastic Container Service (Amazon ECS).
  3. The digital assistant uses a custom LangChain agent to answer questions on the bank’s products and services and orchestrate the onboarding flow.
  4. If the user asks a general question related to the bank’s products or service, the agent will use a custom LangChain tool called ProductSearch. This tool uses Amazon Kendra linked with an Amazon Simple Storage Service (Amason S3) data source that contains the bank’s data. Amazon Kendra is an intelligent enterprise search service powered by machine learning that enables companies to index and search content across their document stores.
  5. If the user indicates that they want to open a new account, the agent will prompt the user for their email. After the user responds, the application will invoke a custom LangChain tool called EmailValidation. This tools checks if there is an existing account in the bank’s Amazon DynamoDB database, by calling an endpoint deployed in Amazon API Gateway.
  6. After the email validation, KYC information is gathered, such as first and last name. Then, the user is prompted for an identity document, which is uploaded to Amazon S3.
  7. The agent will invoke a custom LangChain tool called IDVerification. This tool checks if the user details entered during the session match the ID by calling an endpoint deployed in Amazon API Gateway. The details are verified by extracting the document text using Amazon Textract, a machine learning (ML) service that automatically extracts text, handwriting, layout elements, and data from scanned documents.
  8. After the ID verification, the user is asked for a selfie. The image is uploaded to Amazon S3. Then, the agent will invoke a custom LangChain tool called SelfieVerification. This tool checks if the uploaded selfie matches the face on the ID by calling an endpoint deployed in API Gateway. The face match is detected using Amazon Rekognition, which offers pre-trained and customizable computer vision (CV) capabilities to extract information and insights from your images and videos.
  9. After the face verification is successful, the agent will use a custom LangChain tool called SaveData. This tool creates a new account in the bank’s DynamoDB database by calling an endpoint deployed in API Gateway.
  10. The user is notified that their new account has been created successfully, using Amazon Simple Email Service (Amazon SES).

Prompt design for agent orchestration

Now, let’s take a look at how we give our digital assistant, Penny, the capability to handle onboarding for financial services. The key is the prompt engineering for the custom LangChain agent. This has been specified in PennyAgent.py. This prompt includes onboarding stages and relevant LangChain tools that the agent might need to complete the onboarding steps.

To begin, we provide the agent with a name, role and company.

AGENT_TOOLS_PROMPT = """
Never forget your name is {assistant_name}. You work as a {assistant_role}.
You work at company named {bank_name}

Next, we define the various stages of onboarding and specify the respective tools and expected responses. Having stages in a sequential and structured format while also providing awareness of all possible stages helps the agent determine the onboarding stage with accuracy.

<STAGES>

These are the stages:

Introduction or greeting:  When conversation history is empty, choose stage 1
Response: Start the conversation with a greeting. Say that you can help with {bank_name} related questions or open a bank account for them. Do this only during the start of the conversation.
Tool: 
    
General Banking Questions: Customer asks general questions about AnyBank
Response: Use ProductSearch tool to get the relevant information and answer the question like a banking assistant. Never assume anything.
Tool: ProductSearch
    
Account Open 1: Customer has requested to open an account.
Response: Customer has requested to open an account. Now, respond with a question asking for the customer's email address only to get them started with onboarding. We need the email address to start the process.
Tool:
    
Account Open 2: User provided their email.
Response: Take the email and validate it using a EmailValidation tool. If it is valid and there is no existing account with the email, ask for account type: either CHEQUING or SAVINGS. If it is invalid or there is an existing account with the email, the user must try again. 
Tool: EmailValidation
    
Account Open 3: User provided which account type to open.
Response: Ask the user for their first name
Tool: 

Account Open 4: User provided first name.
Response: Ask the user for their last name
Tool: 

Account Open 5: User provided last name.
Response: Ask the user to upload an identity document.
Tool:
    
Account Open 6: Penny asked for identity document and then System notified that a new file has been uploaded
Response: Take the identity file name and verify it using the IDVerification tool. If the verification is unsuccessful, ask the user to try again. 
Tool: IDVerification
    
Account Open 7: The ID document is valid. 
Response: Ask the user to upload their selfie to compare their face to the ID.
Tool:
    
Account Open 8: Penny asked user for their selfie and then "System notified that a file has been uploaded. "
Response: Take the "selfie" file name and verify it using the SelfieVerification tool. If there is no face match, ask the user to try again.
Tool: SelfieVerification: Use this tool to verify the user selfie and compare faces. 
    
Account Open 9: Face match verified
Response: Give the summary of the all the information you collected and ask user to confirm. 
Tool:
        
Account Open 10: Confirmation
Response: Save the user data for future reference using SaveData tool. Upon saving the data, let the user know that they will receive an email confirmation of the bank account opening.
Tool: SaveData

We append the tools, their descriptions, and their response formats to the prompt. When calling on a specific tool, the agent can generate input parameters as required. Access to all the tools helps the agent identify the best tool choice based on the conversation stage.

TOOLS:
------
Penny has access to the following tools:
{tools}

We include some guidelines that the agent needs to follow while generating outputs. By using emotion-based prompt engineering, we minimize hallucinations and deviation from expected outputs. These guidelines were chosen after extensive testing to minimize edge cases and help prevent common agent mistakes.

<GUIDELINES>

1. If you ever assume any user response without asking, it may cause significant consequences.
2. It is of high priority that you respond and use appropriate tools in their respective stages. If not, it may cause significant consequences.
3. It is of high priority that you never reveal the tools or tool names to the user. Only communicate the outcome.
4. It is critical that you never reveal any details provided by the System including file names. 
5. If ever the user deviates by asking general question during your account opening process, Retrieve the necessary information using 'ProductSearch' tool and answer the question. With confidence, ask user if they want to resume the account opening process and continue from where we left off. 

The agent uses the ReAct framework to make decisions about how to respond based on user input. ReAct provides the agent with a thinking structure, through which it selects the most appropriate tool for a given task. Such frameworks make LLM agents versatile and adaptable to different use cases.

Based on the stage descriptions and the tools available, if the LLM generates a response that requires access to an external tool, then the response of the LLM will include Thought, Decision, Action, Action Input and Observation. The agent comes with a string matcher, which will detect Action and Action Input from the LLM’s response and trigger the respective tool. Based on the response from the tool, the LLM with decide whether to proceed with the Final Answer, and then the output will be returned by the agent.

FORMAT:
------

To use a tool, please always use the following format:
```
Thought: {input}
Decision: Do I need to use a tool? y
Action: what tool to use, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
```
When I am finished, I will have a response like this: 

Final Answer: [your response as a banking assistant]

Finally, we give the agent access to the conversation history to better decide what stage the conversation is currently in. In addition, we also give access to an agent scratchpad where it can store its thought processes to execute certain actions.

Be confident that you are a banking assistant and only respond with final answer.
Begin!

<Conversation history>
{conversation_history}

{agent_scratchpad}

Orchestrating intelligent digital assistants requires thoughtful prompt engineering to handle complex tasks. By structuring the conversation into stages, providing tooling, and setting guidelines, we enable the assistant to systematically complete the onboarding process. This approach allows assistants to scale across use cases while maintaining accuracy. With the right guardrails, assistants can deliver smooth, trustworthy customer experiences.

Prompt design is key to unlocking the versatility of LLMs for real-world automation. Amazon Bedrock Prompt Management can be used to streamline the creation, evaluation, versioning and testing of prompts. This will help developers and prompt engineers save time by applying the same prompt to different onboarding processes. When you create a prompt, you can select a different model for inference and adjust the variables to obtain the best-suited results for a variety of workflows.

The following sections explain how to deploy the solution in your AWS account.

Note: Running this workload would have an estimated hourly cost of $1.34 for the Oregon (us-west-2) AWS Region. Check the pricing details for each service to understand the costs you might be charged for different usage tiers and resource configurations.

Setup

To deploy the agent, visit the project Github Repository, and use the following instructions:

  1. Ensure the pre-requisites are completed as mentioned in the README.
  2. Deploy the solution including the agent, tools infrastructure, and demo application—in that order—based on the instructions in the README.
  3. After the deployment is successful, visit the outputted domain where the demo application is running. You can now begin testing the agent.

Testing the agent

Begin your exploration by accessing the Amplify endpoint where the demonstration is hosted. The demonstration incorporates an interactive chat interface, enabling you to engage in a conversational exchange with the digital assistant, Penny. Whenever you want to initiate a new instance of the agent, refresh the web page.

Let’s start talking to Penny:

  1. Enter Hi

Penny will respond with a friendly greeting

  1. Enter What are the cutoff times to receive wire transfers on the same day?

Penny will use the ProductSearch tool to find the relevant information from the loaded product catalog. You can try asking other questions about the bank’s product or services including the AnyBank Travel Rewards Visa Infinite Card or New Vehicle Loans.

  1. Enter I would like to open a new bank account

Penny will recognize that the account opening flow needs to be initiated and will proceed with the first step, which is asking you for an email address.

Open bank account

  1. Enter the verified customer email you registered with the Amazon SES identity. For our demonstration, we will use anup@test.com(parameter SesCustomerEmail used in the example command to setup infrastructure)

Penny will take the email address and run the EmailValidation Tool. If there is an existing account with this email, it will ask you to retry. Otherwise, it will move on to the next step which is gathering your account type.

  1. Enter I want a savings account or indicate that you want a checking account.

Penny will record your account type and move on to the KYC questions.

  1. Enter Anup

Penny will record your first name and continue gathering the remaining KYC information.

KYC information

  1. Enter Ravi

It will record your last name and prompt you for an ID next. We used Ravi to match the ID document provided below.

  1. Download the picture ID. It’s also located at ./api/lambdas/test/passport.png

Sample passport

Upload it to the chat by selecting Choose File.

After uploading the image, you will receive a confirmation message on the chat stating We have received your document. Penny will use ID Verification to compare the name entered during the session to the document. After verification is complete, Penny will prompt you to upload a selfie.

  1. Upload the selfie located at ./api/lambdas/test/selfie.png to the chat by selecting Choose File.

Sample selfie

After the upload is complete, you will receive a confirmation message on the chat stating We have received your document. Penny will use Selfie Verification to compare the face on the ID to the selfie for a face match. After verification is complete, Penny will prompt you to confirm that you want to proceed.

ID verification

  1. Enter Yes I confirm

Confirmation email

Penny will use Create Account to complete the onboarding process and send an email confirmation. It will inform to you of this update in the chat.

New account creation

Check the customer email you used. The email address specified as the SesCustomerEmail parameter (in this example: anup@test.com) during setup will receive a new email from the email address you set as the SesBankEmail parameter (in this example: owner@anybank.com).

  1. Go to the DynamoDB console, select Table from the navigation pane and select the table created by the AWS CloudFormation This is the accounts table in the bank’s AWS account. From the Table page, choose Explore items. You will see a new account created with the details that you entered.

Account creation DynamoDB

Guardrails and security

Security is a critical part of any application and must be rigorously addressed when developing and deploying solutions, especially those that involve handling sensitive data or interacting with users. For a solution similar to the example in this post, several robust security measures should be implemented to maintain the confidentiality, integrity, and availability of the system.

  • Address the security of the service itself. One approach to mitigate potential biases, toxicity, or other undesirable outputs is to use Constitutional AI techniques, such as those provided by the LangChain library or Guardrails for Amazon Bedrock. By defining and enforcing a set of rules or constraints, the system can be trained to generate outputs that align with predefined ethical principles and values, thereby enhancing the trustworthiness and reliability of the service.
  • To maintain data protection and privacy, implementing a write-only database architecture is recommended. In this setup, the agent or service can write data to the database but is prohibited from reading or retrieving sensitive stored information. This measure effectively isolates sensitive user data, making sure that the agent would be unable to access or disclose confidential details even in the event of a compromise.
  • Prompt injection attacks, where malicious inputs are crafted to manipulate the system’s behavior, are a serious concern in conversational AI systems. To mitigate this risk, it’s crucial to implement robust input validation and sanitization mechanisms. This could include techniques like whitelisting permissible characters, filtering out potentially harmful patterns, and employing context-aware input processing.
  • Secure coding practices, such as input validation, output encoding, and proper error handling, should be rigorously followed throughout the development process. Regular security audits, penetration testing, and vulnerability assessments should be conducted to identify and address potential weaknesses in the system.
  • Amazon API Gateway, a fully managed service, securely handles API traffic, acting as a front door for applications running on AWS. It supports multiple security mechanisms, including AWS Identity and Access Management (IAM) for authentication and authorization, AWS WAF for web application protection, AWS Secrets Manager for securely storing and retrieving secrets, and integration with AWS CloudTrail for API activity logging. API Gateway also supports client-side SSL certificates, API keys, and resource policies for granular access control.
  • Communication between users, the solution, and its internal dependencies should be protected using TLS to encrypt data in transit.
  • Additionally, the data should be encrypted using data-at-rest encryption with AWS Key Management Service (AWS KMS) customer managed keys (CMK).

By implementing these robust security measures and fostering a culture of continuous security awareness and improvement, the solution can better protect against potential threats, safeguard user privacy, and maintain the integrity and reliability of the service.

Cleanup

Follow the cleanup Instructions in the README of the Github repository to remove the environment from your account.

Conclusion

In this post, we presented an end-to-end solution that demonstrates how banks can transform user onboarding with an AI-powered digital assistant. By orchestrating workflows across AWS services, we enabled automated, secure account opening within minutes. The conversational interface delivers exceptional customer experiences while reducing operational costs.

This solution can be quickly deployed and enhanced using the features of Amazon Bedrock. Amazon Bedrock Agents streamlines workflows by executing multistep tasks and integrating with company systems and data sources. Amazon Bedrock Knowledge Bases provides contextual information from proprietary data sources, enhancing the accuracy and relevance of responses. Additionally, Amazon Bedrock Guardrails implements safeguards to enable responsible AI usage, filtering harmful content and protecting sensitive information. These can enable a robust and secure deployment of an AI-powered onboarding solution.

Key outcomes of this solution include:

  • Fully digital onboarding without paper forms or branch visits
  • Automated KYC verification using documents and facial recognition
  • Customers onboarded securely in minutes with email confirmations
  • Lower costs by reducing manual verification workloads
  • Personalized assistance for any product questions 24/7

Instant, secure, and scalable delivery has become the norm that customers demand. This AI assistant solution, powered by AWS, showcases the potential future of user onboarding for financial institutions. As consumer behaviors and expectations continue to be influenced by the latest digital experiences across industries, banks that invest in advanced technologies will gain a competitive edge over their rivals.

Ready to future proof your banking experience? Visit Artificial Intelligence and Machine learning for Financial services with AWS.


About the authors

Anup Ravindranath is a Senior Solutions Architect at Amazon Web Services (AWS) based in Toronto, Canada working with Financial Services organizations. He helps customers to transform their businesses and innovate on cloud.

Arya Subramanyam is a Solutions Architect based in Toronto, Canada. She works with Enterprise Greenfield customers as well as Small & Medium businesses as a technical advisor, helping them solve business challenges with cloud solutions. Arya holds a Bachelor of Applied Science in Computer Engineering from the University of British Columbia, Vancouver. Her passion for Generative AI has led her to develop various solutions leveraging Large Language Models (LLMs) with a focus on prompt engineering and AI agents.

Venkata Satyanarayana Chivatam is a Solutions Architect at AWS. He specializes in Generative AI and Computer Vision, with a particular focus on driving adoption across industries such as healthcare and finance. At AWS, he helps ISV and SMB customers leverage cutting-edge AI technologies to unlock new possibilities and solve complex challenges. He is passionate about supporting businesses of all sizes in their AI journey.

Akshata Ramesh Rao is a Solutions Architect in Toronto, Canada. Akshata works with enterprise customers to accelerate innovation and advise them through technical challenges. She also loves working with SMB customers and help them reach their business objectives quickly, safely, and cost-effectively with AWS services, frameworks, and best practices. Prior to joining AWS, Akshata worked as Devops Engineer at Amazon and holds a master’s degree in computer science from University of Ottawa.

Read More

Build a generative AI Slack chat assistant using Amazon Bedrock and Amazon Kendra

Build a generative AI Slack chat assistant using Amazon Bedrock and Amazon Kendra

Despite the proliferation of information and data in business environments, employees and stakeholders often find themselves searching for information and struggling to get their questions answered quickly and efficiently. This can lead to productivity losses, frustration, and delays in decision-making.

A generative AI Slack chat assistant can help address these challenges by providing a readily available, intelligent interface for users to interact with and obtain the information they need. By using the natural language processing and generation capabilities of generative AI, the chat assistant can understand user queries, retrieve relevant information from various data sources, and provide tailored, contextual responses.

By harnessing the power of generative AI and Amazon Web Services (AWS) services Amazon Bedrock, Amazon Kendra, and Amazon Lex, this solution provides a sample architecture to build an intelligent Slack chat assistant that can streamline information access, enhance user experiences, and drive productivity and efficiency within organizations.

Why use Amazon Kendra for building a RAG application?

Amazon Kendra is a fully managed service that provides out-of-the-box semantic search capabilities for state-of-the-art ranking of documents and passages. You can use Amazon Kendra to quickly build high-accuracy generative AI applications on enterprise data and source the most relevant content and documents to maximize the quality of your Retrieval Augmented Generation (RAG) payload, yielding better large language model (LLM) responses than using conventional or keyword-based search solutions. Amazon Kendra offers simple-to-use deep learning search models that are pre-trained on 14 domains and don’t require machine learning (ML) expertise. Amazon Kendra can index content from a wide range of sources, including databases, content management systems, file shares, and web pages.

Further, the FAQ feature in Amazon Kendra complements the broader retrieval capabilities of the service, allowing the RAG system to seamlessly switch between providing prewritten FAQ responses and dynamically generating responses by querying the larger knowledge base. This makes it well-suited for powering the retrieval component of a RAG system, allowing the model to access a broad knowledge base when generating responses. By integrating the FAQ capabilities of Amazon Kendra into a RAG system, the model can use a curated set of high-quality, authoritative answers for commonly asked questions. This can improve the overall response quality and user experience, while also reducing the burden on the language model to generate these basic responses from scratch.

This solution balances retaining customizations in terms of model selection, prompt engineering, and adding FAQs with not having to deal with word embeddings, document chunking, and other lower-level complexities typically required for RAG implementations.

Solution overview

The chat assistant is designed to assist users by answering their questions and providing information on a variety of topics. The purpose of the chat assistant is to be an internal-facing Slack tool that can help employees and stakeholders find the information they need.

The architecture uses Amazon Lex for intent recognition, AWS Lambda for processing queries, Amazon Kendra for searching through FAQs and web content, and Amazon Bedrock for generating contextual responses powered by LLMs. By combining these services, the chat assistant can understand natural language queries, retrieve relevant information from multiple data sources, and provide humanlike responses tailored to the user’s needs. The solution showcases the power of generative AI in creating intelligent virtual assistants that can streamline workflows and enhance user experiences based on model choices, FAQs, and modifying system prompts and inference parameters.

Architecture diagram

The following diagram illustrates a RAG approach where the user sends a query through the Slack application and receives a generated response based on the data indexed in Amazon Kendra. In this post, we use Amazon Kendra Web Crawler as the data source and include FAQs stored on Amazon Simple Storage Service (Amazon S3). See Data source connectors for a list of supported data source connectors for Amazon Kendra.

ML-16837-arch-diag

The step-by-step workflow for the architecture is the following:

  1. The user sends a query such as What is the AWS Well-Architected Framework? through the Slack app.
  2. The query goes to Amazon Lex, which identifies the intent.
  3. Currently two intents are configured in Amazon Lex (Welcome and FallbackIntent).
  4. The welcome intent is configured to respond with a greeting when a user enters a greeting such as “hi” or “hello.” The assistant responds with “Hello! I can help you with queries based on the documents provided. Ask me a question.”
  5. The fallback intent is fulfilled with a Lambda function.
    1. The Lambda function searches Amazon Kendra FAQs through the search_Kendra_FAQ method by taking the user query and Amazon Kendra index ID as inputs. If there’s a match with a high confidence score, the answer from the FAQ is returned to the user.
      def search_Kendra_FAQ(question, kendra_index_id):
          """
          This function takes in the question from the user, and checks if the question exists in the Kendra FAQs.
          :param question: The question the user is asking that was asked via the frontend input text box.
          :param kendra_index_id: The kendra index containing the documents and FAQs
          :return: If found in FAQs, returns the answer along with any relevant links. If not, returns False and then calls kendra_retrieve_document function.
          """
          kendra_client = boto3.client('kendra')
          response = kendra_client.query(IndexId=kendra_index_id, QueryText=question, QueryResultTypeFilter='QUESTION_ANSWER')
          for item in response['ResultItems']:
              score_confidence = item['ScoreAttributes']['ScoreConfidence']
              # Taking answers from FAQs that have a very high confidence score only
              if score_confidence == 'VERY_HIGH' and len(item['AdditionalAttributes']) > 1:
                  text = item['AdditionalAttributes'][1]['Value']['TextWithHighlightsValue']['Text']
                  url = "None"
                  if item['DocumentURI'] != '':
                      url = item['DocumentURI']
                  return (text, url)
          return (False, False)

    2. If there isn’t a match with a high enough confidence score, relevant documents from Amazon Kendra with a high confidence score are retrieved through the kendra_retrieve_document method and sent to Amazon Bedrock to generate a response as the context.
      def kendra_retrieve_document(question, kendra_index_id):
          """
          This function takes in the question from the user, and retrieves relevant passages based on default PageSize of 10.
          :param question: The question the user is asking that was asked via the frontend input text box.
          :param kendra_index_id: The kendra index containing the documents and FAQs
          :return: Returns the context to be sent to the LLM and document URIs to be returned as relevant data sources.
          """
          kendra_client = boto3.client('kendra')
          documents = kendra_client.retrieve(IndexId=kendra_index_id, QueryText=question)
          text = ""
          uris = set()
          if len(documents['ResultItems']) > 0:
              for i in range(len(documents['ResultItems'])):
                  score_confidence = documents['ResultItems'][i]['ScoreAttributes']['ScoreConfidence']
                  if score_confidence == 'VERY_HIGH' or score_confidence == 'HIGH':
                      text += documents['ResultItems'][i]['Content'] + "n"
                      uris.add(documents['ResultItems'][i]['DocumentURI'])
          return (text, uris)

    3. The response is generated from Amazon Bedrock with the invokeLLM method. The following is a snippet of the invokeLLM method within the fulfillment function. Read more on inference parameters and system prompts to modify parameters that are passed into the Amazon Bedrock invoke model request.
      def invokeLLM(question, context, modelId):
          """
          This function takes in the question from the user, along with the Kendra responses as context to generate an answer
          for the user on the frontend.
          :param question: The question the user is asking that was asked via the frontend input text box.
          :param documents: The response from the Kendra document retrieve query, used as context to generate a better
          answer.
          :return: Returns the final answer that will be provided to the end-user of the application who asked the original
          question.
          """
          # Setup Bedrock client
          bedrock = boto3.client('bedrock-runtime')
          # configure model specifics such as specific model
          modelId = modelId
      
          # body of data with parameters that is passed into the bedrock invoke model request
          body = json.dumps({"max_tokens": 350,
                  "system": "You are a truthful AI assistant. Your goal is to provide informative and substantive responses to queries based on the documents provided. If you do not know the answer to a question, you truthfully say you do not know.",
                  "messages": [{"role": "user", "content": "Answer this user query:" + question + "with the following context:" + context}],
                  "anthropic_version": "bedrock-2023-05-31",
                      "temperature":0,
                  "top_k":250,
                  "top_p":0.999})
      
          # Invoking the bedrock model with your specifications
          response = bedrock.invoke_model(body=body,
                                          modelId=modelId)
          # the body of the response that was generated
          response_body = json.loads(response.get('body').read())
          # retrieving the specific completion field, where you answer will be
          answer = response_body.get('content')
          # returning the answer as a final result, which ultimately gets returned to the end user
          return answer

    4. Finally, the response generated from Amazon Bedrock along with the relevant referenced URLs are returned to the end user.

    When selecting websites to index, adhere to the AWS Acceptable Use Policy and other AWS terms. Remember that you can only use Amazon Kendra Web Crawler to index your own web pages or web pages that you have authorization to index. Visit the Amazon Kendra Web Crawler data source guide to learn more about using the web crawler as a data source. Using Amazon Kendra Web Crawler to aggressively crawl websites or web pages you don’t own is not considered acceptable use.

    Supported features

    The chat assistant supports the following features:

    1. Support for the following Anthropic’s models on Amazon Bedrock:
      • claude-v2
      • claude-3-haiku-20240307-v1:0
      • claude-instant-v1
      • claude-3-sonnet-20240229-v1:0
    2. Support for FAQs and the Amazon Kendra Web Crawler data source
    3. Returns FAQ answers only if the confidence score is VERY_HIGH
    4. Retrieves only documents from Amazon Kendra that have a HIGH or VERY_HIGH confidence score
    5. If documents with a high confidence score aren’t found, the chat assistant returns “No relevant documents found”

    Prerequisites

    To perform the solution, you need to have following prerequisites:

    • Basic knowledge of AWS
    • An AWS account with access to Amazon S3 and Amazon Kendra
    • An S3 bucket to store your documents. For more information, see Step 1: Create your first S3 bucket and the Amazon S3 User Guide.
    • A Slack workspace to integrate the chat assistant
    • Permission to install Slack apps in your Slack workspace
    • Seed URLs for the Amazon Kendra Web Crawler data source
      • You’ll need authorization to crawl and index any websites provided
    • AWS CloudFormation for deploying the solution resources

    Build a generative AI Slack chat assistant

    To build a Slack application, use the following steps:

    1. Request model access on Amazon Bedrock for all Anthropic models
    2. Create an S3 bucket in the us-east-1 (N. Virginia) AWS Region.
    3. Upload the AIBot-LexJson.zip and SampleFAQ.csv files to the S3 bucket
    4. Launch the CloudFormation stack in the us-east-1 (N. Virginia) AWS Region.Launch Stack to create solution resources
    5. Enter a Stack name of your choice
    6. For S3BucketName, enter the name of the S3 bucket created in Step 2
    7. For S3KendraFAQKey, enter the name of the SampleFAQs uploaded to the S3 bucket in Step 3
    8. For S3LexBotKey, enter the name of the Amazon Lex .zip file uploaded to the S3 bucket in Step 3
    9. For SeedUrls, enter up to 10 URLs for the web crawler as a comma delimited list. In the example in this post, we give the publicly available Amazon Bedrock service page as the seed URL
    10. Leave the rest as defaults. Choose Next. Choose Next again on the Configure stack options
    11. Acknowledge by selecting the box and choose Submit, as shown in the following screenshot
      ML-16837-cfn-checkbox
    12. Wait for the stack creation to complete
    13. Verify all resources are created
    14. Test on the AWS Management Console for Amazon Lex
      1. On the Amazon Lex console, choose your chat assistant ${YourStackName}-AIBot
      2. Choose Intents
      3. Choose Version 1 and choose Test, as shown in the following screenshot
        ML-16837-lex-version1
      4. Select the AIBotProdAlias and choose Confirm, as shown in the following screenshot. If you want to make changes to the chat assistant, you can use the draft version, publish a new version, and assign the new version to the AIBotProdAlias. Learn more about Versioning and Aliases.
      5. Test the chat assistant with questions such as, “Which AWS service has 11 nines of durability?” and “What is the AWS Well-Architected Framework?” and verify the responses. The following table shows that there are three FAQs in the sample .csv file.
        _question _answer _source_uri
        Which AWS service has 11 nines of durability? Amazon S3 https://aws.amazon.com/s3/
        What is the AWS Well-Architected Framework? The AWS Well-Architected Framework enables customers and partners to review their architectures using a consistent approach and provides guidance to improve designs over time. https://aws.amazon.com/architecture/well-architected/
        In what Regions is Amazon Kendra available? Amazon Kendra is currently available in the following AWS Regions: Northern Virginia, Oregon, and Ireland https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/
      6. The following screenshot shows the question “Which AWS service has 11 nines of durability?” and its response. You can observe that the response is the same as in the FAQ file and includes a link.
        ML-16837-Q1inLex
      7. Based on the pages you have crawled, ask a question in the chat. For this example, the publicly available Amazon Bedrock page was crawled and indexed. The following screenshot shows the question, “What are agents in Amazon Bedrock?” and and a generated response that includes relevant links.
        ML-16837-Q2inLex
    1. For integration of the Amazon Lex chat assistant with Slack, see Integrating an Amazon Lex V2 bot with Slack. Choose the AIBotProdAlias under Alias in the Channel Integrations

    Run sample queries to test the solution

    1. In Slack, go to the Apps section. In the dropdown menu, choose Manage and select Browse apps.
      ML-16837-slackBrowseApps
    2. Search for ${AIBot} in App Directory and choose the chat assistant. This will add the chat assistant to the Apps section in Slack. You can now start asking questions in the chat. The following screenshot shows the question “Which AWS service has 11 nines of durability?” and its response. You can observe that the response is the same as in the FAQ file and includes a link.
      ML-16837-Q1slack
    3. The following screenshot shows the question, “What is the AWS Well-Architected Framework?” and its response.
      ML-16837-Q2slack
    4. Based on the pages you have crawled, ask a question in the chat. For this example, the publicly available Amazon Bedrock page was crawled and indexed. The following screenshot shows the question, “What are agents in Amazon Bedrock?” and and a generated response that includes relevant links.
      ML-16837-Q3slack
    5. The following screenshot shows the question, “What is amazon polly?” Because there is no Amazon Polly documentation indexed, the chat assistant responds with “No relevant documents found,” as expected.
      ML-16837-Q4slack

    These examples show how the chat assistant retrieves documents from Amazon Kendra and provides answers based on the documents retrieved. If no relevant documents are found, the chat assistant responds with “No relevant documents found.”

    Clean up

    To clean up the resources created by this solution:

    1. Delete the CloudFormation stack by navigating to the CloudFormation console
    2. Select the stack you created for this solution and choose Delete
    3. Confirm the deletion by entering the stack name in the provided field. This will remove all the resources created by the CloudFormation template, including the Amazon Kendra index, Amazon Lex chat assistant, Lambda function, and other related resources.

    Conclusion

    This post describes the development of a generative AI Slack application powered by Amazon Bedrock and Amazon Kendra. This is designed to be an internal-facing Slack chat assistant that helps answer questions related to the indexed content. The solution architecture includes Amazon Lex for intent identification, a Lambda function for fulfilling the fallback intent, Amazon Kendra for FAQ searches and indexing crawled web pages, and Amazon Bedrock for generating responses. The post walks through the deployment of the solution using a CloudFormation template, provides instructions for running sample queries, and discusses the steps for cleaning up the resources. Overall, this post demonstrates how to use various AWS services to build a powerful generative AI–powered chat assistant application.

    This solution demonstrates the power of generative AI in building intelligent chat assistants and search assistants. Explore the generative AI Slack chat assistant: Invite your teams to a Slack workspace and start getting answers to your indexed content and FAQs. Experiment with different use cases and see how you can harness the capabilities of services like Amazon Bedrock and Amazon Kendra to enhance your business operations. For more information about using Amazon Bedrock with Slack, refer to Deploy a Slack gateway for Amazon Bedrock.


    About the authors

    Kruthi Jayasimha Rao is a Partner Solutions Architect with a focus on AI and ML. She provides technical guidance to AWS Partners in following best practices to build secure, resilient, and highly available solutions in the AWS Cloud.

    Mohamed Mohamud is a Partner Solutions Architect with a focus on Data Analytics. He specializes in streaming analytics, helping partners build real-time data pipelines and analytics solutions on AWS. With expertise in services like Amazon Kinesis, Amazon MSK, and Amazon EMR, Mohamed enables data-driven decision-making through streaming analytics.

Read More