How Dialog Axiata used Amazon SageMaker to scale ML models in production with AI Factory and reduced customer churn within 3 months

How Dialog Axiata used Amazon SageMaker to scale ML models in production with AI Factory and reduced customer churn within 3 months

The telecommunications industry is more competitive than ever before. With customers able to easily switch between providers, reducing customer churn is a crucial priority for telecom companies who want to stay ahead. To address this challenge, Dialog Axiata has pioneered a cutting-edge solution called the Home Broadband (HBB) Churn Prediction Model.

This post explores the intricacies of Dialog Axiata’s approach, from the meticulous creation of nearly 100 features across ­10 distinct areas and the implementation of two essential models using Amazon SageMaker:

  • A base model powered by CatBoost, an open source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm
  • An ensemble model, taking advantage of the strengths of multiple machine learning (ML) models

About Dialog Axiata

Dialog Axiata PLC (part of the Axiata Group Berhad) is one of Sri Lanka’s largest quad-play telecommunications service providers and the country’s largest mobile network operator with 17.1 million subscribers, which amounts to 57% of the Sri Lankan mobile market. Dialog Axiata provides a variety of services, such as fixed-line, home broadband, mobile, television, payment apps, and financial services in Sri Lanka.

In 2022, Dialog Axiata made significant progress in their digital transformation efforts, with AWS playing a key role in this journey. They focused on improving customer service using data with artificial intelligence (AI) and ML and saw positive results, with their Group AI Maturity increasing from 50% to 80%, according to the TM Forum’s AI Maturity Index.

Dialog Axiata runs some of their business-critical telecom workloads on AWS, including Charging Gateway, Payment Gateway, Campaign Management System, SuperApp, and various analytics tasks. They use variety of AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Kubernetes Service (Amazon EKS) for computing, Amazon Relational Database Service (Amazon RDS) for databases, Amazon Simple Storage Service (Amazon S3) for object storage, Amazon OpenSearch Service for search and analytics, SageMaker for ML, and AWS Glue for data integration. This strategic use of AWS services delivers efficiency and scalability of their operations, as well as the implementation of advanced AI/ML applications.

For more about how Axiata uses AWS services, see Axiata Selects AWS as its Primary Cloud Provider to Drive Innovation in the Telecom Industry

Challenges with understanding customer churn

The Sri Lankan telecom market has high churn rates due to several factors. Multiple mobile operators provide similar services, making it easy for customers to switch between providers. Prepaid services dominate the market, and multi-SIM usage is widespread. These conditions lead to a lack of customer loyalty and high churn rates.

In addition to its core business of mobile telephony, Dialog Axiata also offers a number of services, including broadband connections and Dialog TV. However, customer churn is a common issue in the telecom industry. Therefore, Dialog Axiata needs to find ways to reduce their churn rate and retain more of their existing home broadband customers. Potential solutions could involve improving customer satisfaction, enhancing value propositions, analyzing reasons for churn, or implementing customer retention initiatives. The key is for Dialog Axiata to gain insights into why customers are leaving and take meaningful actions to increase customer loyalty and satisfaction.

Solution overview

To reduce customer churn, Dialog Axiata used SageMaker to build a predictive model that assigns each customer a churn risk score. The model was trained on demographic, network usage, and network outage data from across the organization. By predicting churn 45 days in advance, Dialog Axiata is able to proactively retain customers and significantly reduce customer churn.

Dialog Axiata’s churn prediction approach is built on a robust architecture involving two distinct pipelines: one dedicated to training the models, and the other for inference or making predictions. The training pipeline is responsible for developing the base model, which is a CatBoost model trained on a comprehensive set of features. To further enhance the predictive capabilities, an ensemble model is also trained to identify potential churn instances that may have been missed by the base model. This ensemble model is designed to capture additional insights and patterns that the base model alone may not have effectively captured.

The integration of the ensemble model alongside the base model creates a synergistic effect, resulting in a more comprehensive and accurate inference process. By combining the strengths of both models, Dialog Axiata’s churn prediction system gains an enhanced overall predictive capability, providing a more robust and reliable identification of customers at risk of churning.

Both the training and inference pipelines are run three times per month, aligning with Dialog Axiata’s billing cycle. This regular schedule makes sure that the models are trained and updated with the latest customer data, enabling timely and accurate churn predictions.

In the training process, features are sourced from Amazon SageMaker Feature Store, which houses nearly 100 carefully curated features. Because real-time inference is not a requirement for this specific use case, an offline feature store is used to store and retrieve the necessary features efficiently. This approach allows for batch inference, significantly reducing daily expenses to under $0.50 while processing batch sizes averaging around 100,000 customers within a reasonable runtime of approximately 50 minutes.

Dialog Axiata has meticulously selected instance types to strike a balance between optimal resource utilization and cost-effectiveness. However, should the need arise for faster pipeline runtime, larger instance types can be recommended. This flexibility allows Dialog Axiata to adjust the pipeline’s performance based on specific requirements, while considering the trade-off between speed and cost considerations.

After the predictions are generated separately using both the base model and the ensemble model, Dialog Axiata takes action to retain the customers identified as potential churn risks. The customers predicted to churn by the base model, along with those exclusively identified by the ensemble model, are targeted with personalized retention campaigns. By excluding any overlapping customers between the two models, Dialog Axiata ensures a focused and efficient outreach strategy.

The following figure illustrates the output predictions and churn probabilities generated by the base model and the ensemble model.

The first table is the output from the base model, which provides valuable insights into each customer’s churn risk. The columns in this table include a customer identifier (Cx), a Churn Reason column that highlights potential reasons for churn, such as Daily Usage or ARPU Drop (Average Revenue Per User), and a Churn Probability column that quantifies the likelihood of each customer churning.

The second table presents the output from the ensemble model, a complementary approach designed to capture additional churn risks that may have been missed by the base model. This table has two columns: the customer identifier (Cx) and a binary Churn column that indicates whether the customer is predicted to churn (1) or not (0).

The arrows connecting the two tables visually represent the process Dialog Axiata employs to comprehensively identify customers at risk of churning.

The following figure showcases the comprehensive output of this analysis, where customers are meticulously segmented, scored, and classified according to their propensity to churn or discontinue their services. The analysis delves into various factors, such as customer profiles, usage patterns, and behavioral data, to accurately identify those at a higher risk of churning. With this predictive model, Dialog Axiata can pinpoint specific customer segments that require immediate attention and tailored retention efforts.

With this powerful information, Dialog Axiata develops targeted retention strategies and campaigns specifically designed for high-risk customer groups. These campaigns may include personalized offers, as shown in the following figure, incentives, or customized communication aimed at addressing the unique needs and concerns of at-risk customers.

These personalized campaigns, tailored to each customer’s needs and preferences, aim to proactively address their concerns and provide compelling reasons for them to continue their relationship with Dialog Axiata.

Methodologies

This solution uses the following methodologies:

  • Comprehensive analysis of customer data – The foundation of the solution’s success lies in the comprehensive analysis of more than 100 features spanning demographic, usage, payment, network, package, geographic (location), quad-play, customer experience (CX) status, complaint, and other related data. This meticulous approach allows Dialog Axiata to gain valuable insights into customer behavior, enabling them to predict potential churn events with remarkable accuracy.
  • Dual-model strategy (base and ensemble models) – What sets Dialog Axiata’s approach apart is the use of two essential models. The base model, powered by CatBoost, provides a solid foundation for churn prediction. The threshold probability to define churn is calculated by considering ROC optimization and business requirements. Concurrently, the ensemble model strategically combines the strengths of various algorithms. This combination enhances the robustness and accuracy of the predictions. The models are developed considering precision as the evaluation parameter.
  • Actionable insights shared with business units – The insights derived from the models are not confined to the technical realm. Dialog Axiata ensures that these insights are effectively communicated and put into action by sharing the models separately with the business units. This collaborative approach means that the organization is better equipped to proactively address customer churn.
  • Proactive measures with two action types – Equipped with insights from the models, Dialog Axiata has implemented two main action types: network issue-based and non-network issue-based. During the inference phase, the churn status and churn reason are predicted. The top five features that have a high probability for the churn reason are selected using SHAP (SHapley Additive exPlanations). Then, the selected features associated with the churn reason are further classified into two categories: network issue-based and non-network issue-based. If there are features related to network issues, those users are categorized as network issue-based users. The resultant categorization, along with the predicted churn status for each user, is then transmitted for campaign purposes. This information is valuable in scheduling targeted campaigns based on the identified churn reasons, enhancing the precision and effectiveness of the overall campaign strategy.

Dialog Axiata’s AI Factory

Dialog Axiata built the AI Factory to facilitate running all AI/ML workloads on a single platform with multiple capabilities across various building blocks. To tackle technical aspects and challenges related to continuous integration and continuous delivery (CI/CD) and cost-efficiency, Dialog Axiata turned to the AI Factory framework. Using the power of SageMaker as the platform, they implemented separate SageMaker pipelines for model training and inference, as shown in the following diagram.

A primary advantage lies in cost reduction through the implementation of CI/CD pipelines. By conducting experiments within these automated pipelines, significant cost savings could be achieved. It also helps maintain an experiment version tracking system. Additionally, the integration of AI Factory components contributes to a reduction in time to production and overall workload by reducing repetitive tasks through the use of reusable artifacts. The incorporation of an experiment tracking system facilitates the monitoring of performance metrics, enabling a data-driven approach to decision-making.

Furthermore, the deployment of alerting systems enhances the proactive identification of failures, allowing for immediate actions to resolve issues. Data drift and model drift are also monitored. This streamlined process makes sure that any issues are addressed promptly, minimizing downtime and optimizing system reliability. By developing this project under the AI Factory framework, Dialog Axiata could overcome the aforementioned challenges.

Furthermore, the AI Factory framework provides a robust security framework to govern confidential user data and access permissions. It offers solutions to optimize AWS costs, including lifecycle configurations, alerting systems, and monitoring dashboards. These measures contribute to enhanced data security and cost-effectiveness, aligning with Dialog Axiata’s objectives and resulting in the efficient operation of AI initiatives.

Dialog Axiata’s MLOps process

The following diagram illustrates Dialog Axiata’s MLOps process.

The following key components are used in the process:

  • SageMaker as the ML Platform – Dialog Axiata uses SageMaker as their core ML platform to perform feature engineering, and train and deploy models in production.
  • SageMaker Feature Store – By using a centralized repository for ML features, SageMaker Feature Store enhances data consumption and facilitates experimentation with validation data. Instead of directly ingesting data from the data warehouse, the required features for training and inference steps are taken from the feature store. With SageMaker Feature Store, Dialog Axiata could reduce the time for feature creation because they could reuse the same features.
  • Amazon SageMaker PipelinesAmazon SageMaker Pipelines is a CI/CD service for ML. These workflow automation components helped the Dialog Axiata team effortlessly scale their ability to build, train, test, and deploy multiple models in production; iterate faster; reduce errors due to manual orchestration; and build repeatable mechanisms.
  • Reusable components – Employing containerized environments, such as Docker images, and custom modules promoted the bring your own code approach within Dialog Axiata’s ML pipelines.
  • Monitoring and alerting – Monitoring tools and alert systems provided ongoing success by keeping track of the model and pipeline status.

Business outcomes

The churn prediction solution implemented by Dialog Axiata has yielded remarkable business outcomes, exemplifying the power of data-driven decision-making and strategic deployment of AI/ML technologies. Within a relatively short span of 5 months, the company witnessed a substantial reduction in month-over-month gross churn rates, a testament to the effectiveness of the predictive model and the actionable insights it provides.

This outstanding achievement not only underscores the robustness of the solution, it also highlights its pivotal role in fortifying Dialog Axiata’s position as a leading player in Sri Lanka’s highly competitive telecommunications landscape. By proactively identifying and addressing potential customer churn risks, the company has reinforced its commitment to delivering exceptional service and fostering long-lasting customer relationships.

Conclusion

Dialog Axiata’s journey in overcoming telecom churn challenges showcases the power of innovative solutions and the seamless integration of AI technologies. By using the AI Factory framework and SageMaker, Dialog Axiata not only addressed complex technical challenges, but also achieved tangible business benefits. This success story emphasizes the crucial role of predictive analytics in staying ahead in the competitive telecom industry, demonstrating the transformative impact of advanced AI models.

We appreciate you for reading this post, and hope you learned something new and useful. Please don’t hesitate to leave your feedback in the comments section.

Thank you Nilanka S. Weeraman, Sajani Jayathilaka, and Devinda Liyanage for your valuable contributions to this blog post.


About the Authors

Senthilvel (Vel) Palraj is a Senior Solutions Architect at AWS with over 15 years of IT experience. In this role, he helps customers in the telco, and media and entertainment industries across India and SAARC countries transition to the cloud. Before joining AWS India, Vel worked as a Senior DevOps Architect with AWS ProServe North America, supporting major Fortune 500 corporations in the United States. He is passionate about GenAI & AIML and leverages his deep knowledge to provide strategic guidance to companies looking to adopt and optimize AWS services. Outside of work, Vel enjoys spending time with his family and mountain biking on rough terrains.

Chamika Ramanayake is the Head of AI Platforms at Dialog Axiata PLC, Sri Lanka’s leading telecommunications company. He leverages his 7 years of experience in the telecommunication industry when leading his team to design and set the foundation to operationalize the end-to-end AI/ML system life cycle in the AWS cloud environment. He holds an MBA from PIM, University of Sri Jayawardenepura, and a B.Sc. Eng (Hons) in Electronics and Telecommunication Engineering from the University of Moratuwa.

Read More

Amazon SageMaker now integrates with Amazon DataZone to streamline machine learning governance

Amazon SageMaker now integrates with Amazon DataZone to streamline machine learning governance

Amazon SageMaker is a fully managed machine learning (ML) service that provides a range of tools and features for building, training, and deploying ML models. Amazon DataZone is a data management service that makes it faster and easier for customers to catalog, discover, share, and govern data stored across AWS, on-premises, and third-party sources.

Today, we are excited to announce an integration between Amazon SageMaker and Amazon DataZone to help you set up infrastructure with security controls, collaborate on machine learning (ML) projects, and govern access to data and ML assets.

When solving a business problem with ML, you create ML models from training data and integrate those models with business applications to make predictive decisions. For example, you could use an ML model for loan application processing to make decisions such as approving or denying a loan. When deploying such ML models, effective ML governance helps build trust in ML-powered applications, minimize risks, and promote responsible AI practices.

A comprehensive governance strategy spans across infrastructure, data, and ML. ML governance requires implementing policies, procedures, and tools to identify and mitigate various risks associated with ML use cases. Applying governance practices at every stage of the ML lifecycle is essential for successfully maximizing the value for the organization. For example, when building a ML model for a loan application processing use case, you can align the model development and deployment with your organization’s overall governance policies and controls to create effective loan approval workflows.

However, it might be challenging and time-consuming to apply governance across an ML lifecycle because it typically requires custom workflows and integration of several tools. With the new built-in integration between SageMaker and Amazon DataZone, you can streamline setting up ML governance across infrastructure, collaborate on business initiatives, and govern data and ML assets in just a few clicks.

For governing ML use cases, this new integration offers the following capabilities:

  • Business project management – You can create, edit, and view projects, as well as add users to start collaborating on the shared business objective
  • Infrastructure management – You can create multiple project environments and deploy infrastructure resources with embedded security controls to meet the enterprise needs
  • Asset governance – Users can search, discover, request access, and publish data and ML assets along with business metadata to the enterprise business catalog

In this post, we dive deep into how to set up and govern ML use cases. We discuss the end-to-end journey for setup and configuration of the SageMaker and Amazon DataZone integration. We also discuss how you can use self-service capabilities to discover, subscribe, consume, and publish data and ML assets as you work through your ML lifecycle.

Solution overview

With Amazon DataZone, administrators and data stewards who oversee an organization’s data assets can manage and govern access to data. These controls are designed to enforce access with the right level of privileges and context. Amazon DataZone makes it effortless for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization so that they can discover, use, and collaborate to derive data-driven insights. The following diagram illustrates a sample architecture of Amazon DataZone and Amazon SageMaker integration.

With this integration, you can deploy SageMaker infrastructure using blueprints. The new SageMaker blueprint provides a well-architected infrastructure template. With this template, ML administrators can build a SageMaker environment profile with appropriate controls from services such as Amazon Virtual Private Cloud (VPC), Amazon Key Management Service (KMS Keys), and AWS Identity and Access Management (IAM), and enable ML builders to use this environment profile to deploy a SageMaker domain in minutes. When you create a SageMaker environment using the SageMaker environment profile, Amazon DataZone provisions a data and ML asset catalog, Amazon SageMaker Studio, and (IAM) roles for managing Amazon DataZone project permissions. The following diagram shows how the SageMaker environment fits in with the existing environments in Amazon DataZone projects.

To facilitate data and ML asset governance from SageMaker Studio, we extended SageMaker Studio to incorporate the following component:

  • Asset – A data or ML resource that can be published to a catalog or project inventory, discovered, and shared. Amazon Redshift tables and AWS Glue tables are original Amazon DataZone assets. With this integration, we introduce two more asset types: SageMaker Feature Groups and Model Package Groups.
  • Owned assets – A collection of project inventory assets discoverable only by project members. These are the staging assets in the project inventory that are not available to Amazon DataZone domain users until they are explicitly published to the Amazon DataZone business catalog.
  • Asset catalog – A collection of published assets in the Amazon DataZone business catalog discoverable across your organization with business context, thereby enabling everyone in your organization to find assets quickly for their use case.
  • Subscribed assets – A collection of assets the subscriber has been approved from the Amazon DataZone business catalog. Owners of those assets have to approve the request for access before the subscriber can consume them.

The following diagram shows an example of an ML asset like Customer-Churn-Model lifecycle with the described components.

In the following sections, we show you the user experience of the SageMaker and Amazon DataZone integration with an example. We demonstrate how to set up Amazon DataZone, including a domain, project, and SageMaker environment, and how to perform asset management using SageMaker Studio. The following diagram illustrates our workflow.

Set up an Amazon DataZone domain, project, and SageMaker environment

On the Amazon DataZone console, administrators create an Amazon DataZone domain, get access to the Amazon DataZone data portal, and provision a new project with access to specific data and users.

Administrators use the SageMaker blueprint that has enterprise level security controls to setup the SageMaker environment profile. Then, the SageMaker infrastructure with appropriate organizational boundaries will deploy in minutes so that ML builders can start using it for their ML use cases.

In the Amazon DataZone data portal, ML builders can create or join a project to collaborate on the business problem being solved. To start their ML use case in SageMaker, they use the SageMaker environment profile made by the administrators to create a SageMaker environment or use an existing one.

ML builders can then seamlessly federate into SageMaker Studio from the Amazon DataZone data portal with just a few clicks. The following actions can happen in SageMaker Studio:

  • Subscribe – SageMaker allows you to find, access, and consume the assets in the Amazon DataZone business catalog. When you find an asset in the catalog that you want to access, you need to subscribe to the asset, which creates a subscription request to the asset owner.
  • Publish – SageMaker allows you to publish your assets and their metadata as an owner of the asset to the Amazon DataZone business catalog so that others in the organization can subscribe and consume in their ML use cases.

Perform asset management using SageMaker Studio

In SageMaker Studio, ML builders can search, discover, and subscribe to data and ML assets in their business catalog. They can consume these assets for ML workflows such as data preparation, model training, and feature engineering in SageMaker Studio and SageMaker Canvas. Upon completing the ML tasks, ML builders can publish data, models, and feature groups to the business catalog for governance and discoverability.

Search and discover assets

After ML builders are federated into SageMaker Studio, they can view the Assets option in the navigation pane.

On the Assets page, ML builders can search and discover data assets and ML assets without additional administrator overhead.

The search result displays all the assets corresponding to the search criteria, including a name and description. ML builders can further filter by the type of asset to narrow down their results. The following screenshot is an example of available assets from a search result.

Subscribe to assets

After ML builders discover the asset from their search results, they can choose the asset to see details such as schema or metadata to understand whether the asset is useful for their use case.

To gain access to the asset, choose Subscribe to initiate the request for access from the asset owner. This action allows data governance for the asset owners to determine which members of the organization can access their assets.

The owner of the asset will be able to see the request in the Incoming subscription requests section on the Assets page. The asset owners can approve or reject the request with justifications. ML builders will also be able to see the corresponding action on the Assets page in the Outgoing subscription requests section. The following screenshot shows an example of managing asset requests and the Subscribed assets tab. In the next steps, we demonstrate how a subscribed data asset like mkt_sls_table and an ML asset like Customer-Churn-Model are used within SageMaker.

Consume subscribed assets

After ML builders are approved to access the subscribed assets, they can choose to use Amazon SageMaker Canvas or JupyterLab within SageMaker Studio. In this section, we explore the scenarios in which ML builders can consume the subscribed assets.

Consume a subscribed Model Package Group in SageMaker Studio

ML builders can see all the subscribed Model Package Groups in SageMaker Studio by choosing Open in Model Registry on the asset details page. ML builders are also able to consume the subscribed model by deploying the model to an endpoint for prediction. The following screenshot shows an example of opening a subscribed model asset.

Consume a subscribed data asset in SageMaker Canvas

When ML builders open the SageMaker Canvas app from SageMaker Studio, they are able to use Amazon SageMaker Data Wrangler and datasets. ML builders can view their subscribed data asset to perform experimentation and build models. As part of this integration, ML builders can view their subscribed assets under sub_db, and publish their assets via pub_db.The created models can then be registered in the Amazon SageMaker Model Registry from SageMaker Canvas. The following screenshot is an example of the subscribed asset mkt_sls_table for data preparation in SageMaker Canvas.

Consume a subscribed data asset in JupyterLab notebooks

ML builders can navigate to JupyterLab in SageMaker Studio to open a notebook and start their data experimentation. In JupyterLab notebooks, ML builders are able to see the subscribed data assets to query in their notebook and consume for experimentation and model building. The following screenshot is an example of the subscribed asset mkt_sls_table for data preparation in SageMaker Studio.

Publish assets

After experimentation and analysis, ML builders are able to share the assets with the rest of the organization by publishing them to the Amazon DataZone business catalog.  They can also make their assets only available to the project members by just publishing to the project inventory. ML builders can achieve these tasks by using the SageMaker SDK or publishing directly from SageMaker Studio.

You can publish ML assets by navigating to the specific asset tab and choosing Publish to asset catalog or Publish to inventory. The following screenshot show how you can publish feature group to asset catalog.

The following screenshot show how you can also publish model group to asset catalog or project inventory.

On the Assets page, you can use the data source feature to publish data assets like an AWS Glue table or Redshift table.

Conclusion

Governance is a multi-faceted discipline that encompasses controls across infrastructure management, data management, model management, access management, policy management, and more. ML governance plays a key role for organizations to successfully scale their ML usage across a wide range of use cases and also mitigate technical and operational risks.

The new SageMaker and Amazon DataZone integration enables your organization to streamline infrastructure controls and permissions, in addition to data and ML asset governance in ML projects. The provisioned ML environment is secure, scalable, and reliable for your teams to access data and ML assets, and build and train ML models.

We would like to hear from you on how this new capability is helping your ML governance use cases. Be on the lookout for more data and ML governance blog posts. Try out this new SageMaker integration for ML governance capability and leave your comments in the comments section.


About the authors

Siamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, digital transformation, and enabling automation to improve overall organizational efficiency and productivity. He has over 7 years of automation experience deploying various technologies. In his spare time, Siamak enjoys exploring the outdoors, long-distance running, and playing sports.

Kareem Syed-Mohammed is a Product Manager at AWS. He is focused on ML Observability and ML Governance. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.

Dr. Sokratis Kartakis is a Principal Machine Learning and Operations Specialist Solutions Architect at AWS. Sokratis focuses on enabling enterprise customers to industrialize their Machine Learning (ML) and generative AI solutions by exploiting AWS services and shaping their operating model, i.e. MLOps/FMOps/LLMOps foundations, and transformation roadmap leveraging best development practices. He has spent 15+ years on inventing, designing, leading, and implementing innovative end-to-end production-level ML and AI solutions in the domains of energy, retail, health, finance, motorsports etc.

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 3-year-old Sheepadoodle.

Read More

Boost employee productivity with automated meeting summaries using Amazon Transcribe, Amazon SageMaker, and LLMs from Hugging Face

Boost employee productivity with automated meeting summaries using Amazon Transcribe, Amazon SageMaker, and LLMs from Hugging Face

The prevalence of virtual business meetings in the corporate world, largely accelerated by the COVID-19 pandemic, is here to stay. Based on a survey conducted by American Express in 2023, 41% of business meetings are expected to take place in hybrid or virtual format by 2024. Attending multiple meetings daily and keeping track of all ongoing topics gets increasingly more difficult to manage over time. This can have a negative impact in many ways, from delayed project timelines to loss of customer trust. Writing meeting summaries is the usual remedy to overcome this challenge, but it disturbs the focus required to listen to ongoing conversations.

A more efficient way to manage meeting summaries is to create them automatically at the end of a call through the use of generative artificial intelligence (AI) and speech-to-text technologies. This allows attendees to focus solely on the conversation, knowing that a transcript will be made available automatically at the end of the call.

This post presents a solution to automatically generate a meeting summary from a recorded virtual meeting (for example, using Amazon Chime) with several participants. The recording is transcribed to text using Amazon Transcribe and then processed using Amazon SageMaker Hugging Face containers to generate the meeting summary. The Hugging Face containers host a large language model (LLM) from the Hugging Face Hub.

If you prefer to generate post call recording summaries with Amazon Bedrock rather than Amazon SageMaker, checkout this Bedrock sample solution. For a generative AI powered Live Meeting Assistant that creates post call summaries, but also provides live transcripts, translations, and contextual assistance based on your own company knowledge base, see our new LMA solution.

Solution overview

The entire infrastructure of the solution is provisioned using the AWS Cloud Development Kit (AWS CDK), which is an infrastructure as code (IaC) framework to programmatically define and deploy AWS resources. The framework provisions resources in a safe, repeatable manner, allowing for a significant acceleration of the development process.

Amazon Transcribe is a fully managed service that seamlessly runs automatic speech recognition (ASR) workloads in the cloud. The service allows for simple audio data ingestion, easy-to-read transcript creation, and accuracy improvement through custom vocabularies. Amazon Transcribe’s new ASR foundation model supports 100+ language variants. In this post, we use the speaker diarization feature, which enables Amazon Transcribe to differentiate between a maximum of 10 unique speakers and label a conversation accordingly.

Hugging Face is an open-source machine learning (ML) platform that provides tools and resources for the development of AI projects. Its key offering is the Hugging Face Hub, which hosts a vast collection of over 200,000 pre-trained models and 30,000 datasets. The AWS partnership with Hugging Face allows a seamless integration through SageMaker with a set of Deep Learning Containers (DLCs) for training and inference, and Hugging Face estimators and predictors for the SageMaker Python SDK.

Generative AI CDK Constructs, an open-source extension of AWS CDK, provides well-architected multi-service patterns to quickly and efficiently create repeatable infrastructure required for generative AI projects on AWS. For this post, we illustrate how it simplifies the deployment of foundation models (FMs) from Hugging Face or Amazon SageMaker JumpStart with SageMaker real-time inference, which provides persistent and fully managed endpoints to host ML models. They are designed for real-time, interactive, and low-latency workloads and provide auto scaling to manage load fluctuations. For all languages that are supported by Amazon Transcribe, you can find FMs from Hugging Face supporting summarization in corresponding languages

The following diagram depicts the automated meeting summarization workflow.

Architecture Diagram

The workflow consists of the following steps:

  1. The user uploads the meeting recording as an audio or video file to the project’s Amazon Simple Storage Service (Amazon S3) bucket, in the /recordings folder.
  2. Every time a new recording is uploaded to this folder, an AWS Lambda Transcribe function is invoked and initiates an Amazon Transcribe job that converts the meeting recording into text. Transcripts are then stored in the project’s S3 bucket under /transcriptions/TranscribeOutput/.
  3. This triggers the Inference Lambda function, which preprocesses the transcript file into an adequate format for ML inference, stores it in the project’s S3 bucket under the prefix /summaries/InvokeInput/processed-TranscribeOutput/, and invokes a SageMaker endpoint. The endpoint hosts the Hugging Face model that summarizes the processed transcript. The summary is loaded into the S3 bucket under the prefix /summaries. Note that the prompt template used in this example includes a single instruction, however for more sophisticated requirements the template can be easily extended to tailor the solution to your own use case.
  4. This S3 event triggers the Notification Lambda function, which pushes the summary to an Amazon Simple Notification Service (Amazon SNS) topic.
  5. All subscribers of the SNS topic (such as meeting attendees) receive the summary in their email inbox.

In this post, we deploy the Mistral 7B Instruct, an LLM available in the Hugging Face Model Hub, to a SageMaker endpoint to perform the summarization tasks. Mistral 7B Instruct is developed by Mistral AI. It is equipped with over 7 billion parameters, enabling it to process and generate text based on user instructions. It has been trained on a wide-ranging corpus of text data to understand various contexts and nuances of language. The model is designed to perform tasks such as answering questions, summarizing information, and creating content, among others, by following specific prompts given by users. Its effectiveness is measured through metrics like perplexity, accuracy, and F1 score, and it is fine-tuned to respond to instructions with relevant and coherent text outputs.

Prerequisites

To follow along with this post, you should have the following prerequisites:

Deploy the solution

To deploy the solution in your own AWS account, refer to the GitHub repository to access the full source code of the AWS CDK project in Python:

git clone https://github.com/aws-samples/audio-conversation-summary-with-hugging-face-and-transcribe.git
cd audio-conversation-summary-with-hugging-face-and-transcribe/infrastructure
pip install -r requirements.txt

If you are deploying AWS CDK assets for the first time in your AWS account and the AWS Region you specified, you need to run the bootstrap command first. It sets up the baseline AWS resources and permissions required for AWS CDK to deploy AWS CloudFormation stacks in a given environment:

cdk bootstrap aws://<ACCOUNT_ID>/<AWS_REGION>

Finally, run the following command to deploy the solution. Specify the summary’s recipient mail address in the SubscriberEmailAddress parameter:

cdk deploy --parameters SubscriberEmailAddress="<SUBSCRIBER_MAIL_ADDRESS>"

Test the solution

We have provided a few sample meeting recordings in the data folder of the project repository. You can upload the test.mp4 recording into the project’s S3 bucket under the /recordings folder. The summary will be saved in Amazon S3 and sent to the subscriber. The end-to-end duration is approximately 2 minutes given an input of approximately 250 tokens.

The following figure shows the input conversation and output summary.

Limitations

This solution has the following limitations:

  • The model provides high-accuracy completions for English language. You can use other languages such as Spanish, French, or Portuguese, but the quality of the completions may degrade. You can find other Hugging Face models that are better suited for other languages.
  • The model used in this post is limited by a context length of approximately 8,000 tokens, which equates to approximately 6,000 words. If a larger context length is required, you can replace the model by referencing the new model ID in the respective AWS CDK construct.
  • Like other LLMs, Mistral 7B Instruct may hallucinate, generating content that strays from factual reality or includes fabricated information.
  • The format of the recordings must be either .mp4, .mp3, or .wav.

Clean up

To delete the deployed resources and stop incurring charges, run the following command:

cdk destroy

Alternatively, to use the AWS Management Console, complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select the stack called Text-summarization-Infrastructure-stack and choose Delete.

Conclusion

In this post, we proposed an architecture pattern to automatically transform your meeting recordings into insightful conversation summaries. This workflow showcases how the AWS Cloud and Hugging Face can help you accelerate with your generative AI application development by orchestrating a combination of managed AI services such as Amazon Transcribe, and externally sourced ML models from the Hugging Face Hub such as those from Mistral AI.

If you are eager to learn more about how conversation summaries can apply to a contact center environment, you can deploy this technique in our suite of solutions for Live Call Analytics and Post Call Analytics.

References

Mistral 7B release post, by Mistral AI

Our team

This post has been created by AWS Professional Services, a global team of experts that can help realize desired business outcomes when using the AWS Cloud. We work together with your team and your chosen member of the AWS Partner Network (APN) to implement your enterprise cloud computing initiatives. Our team provides assistance through a collection of offerings that help you achieve specific outcomes related to enterprise cloud adoption. We also deliver focused guidance through our global specialty practices, which cover a variety of solutions, technologies, and industries.


About the Authors

Gabriel Rodriguez Garcia is a Machine Learning engineer at AWS Professional Services in Zurich. In his current role, he has helped customers achieve their business goals on a variety of ML use cases, ranging from setting up MLOps inference pipelines to developing a fraud detection application. Whenever he is not working, he enjoys doing physical activities, listening to podcasts, or reading books.

Jahed Zaïdi is an AI & Machine Learning specialist at AWS Professional Services in Paris. He is a builder and trusted advisor to companies across industries, helping businesses innovate faster and on a larger scale with technologies ranging from generative AI to scalable ML platforms. Outside of work, you will find Jahed discovering new cities and cultures, and enjoying outdoor activities.

Mateusz Zaremba is a DevOps Architect at AWS Professional Services. Mateusz supports customers at the intersection of machine learning and DevOps specialization, helping them to bring value efficiently and securely. Beyond tech, he is an aerospace engineer and avid sailor.

Kemeng Zhang is currently working at AWS Professional Services in Zurich, Switzerland, with a specialization in AI/ML. She has been part of multiple NLP projects, from behavioral change in digital communication to fraud detection. Apart from that, she is interested in UX design and playing cards.

Read More

How Veritone uses Amazon Bedrock, Amazon Rekognition, Amazon Transcribe, and information retrieval to update their video search pipeline

How Veritone uses Amazon Bedrock, Amazon Rekognition, Amazon Transcribe, and information retrieval to update their video search pipeline

This post is co-written with Tim Camara, Senior Product Manager at Veritone.

Veritone is an artificial intelligence (AI) company based in Irvine, California. Founded in 2014, Veritone empowers people with AI-powered software and solutions for various applications, including media processing, analytics, advertising, and more. It offers solutions for media transcription, facial recognition, content summarization, object detection, and other AI capabilities to solve the unique challenges professionals face across industries.

Veritone began its journey with its foundational AI operating system, aiWARETM, solving industry and brand-specific challenges by building applications on top of this powerful technology. Growing in the media and entertainment space, Veritone solves media management, broadcast content, and ad tracking issues. Alongside these applications, Veritone offers media services including AI-powered audio advertising and influencer marketing, content licensing and media monetization services, and professional services to build bespoke AI solutions.

With a decade of enterprise AI experience, Veritone supports the public sector, working with US federal government agencies, state and local government, law enforcement agencies, and legal organizations to automate and simplify evidence management, redaction, person-of-interest tracking, and eDiscovery. Veritone has also expanded into the talent acquisition space, serving HR teams worldwide with its powerful programmatic job advertising platform and distribution network.

Using generative AI and new multimodal foundation models (FMs) could be very strategic for Veritone and the businesses they serve, because it would significantly improve media indexing and retrieval based on contextual meaning—a critical first step to eventually generating new content. Building enhanced semantic search capabilities that analyze media contextually would lay the groundwork for creating AI-generated content, allowing customers to produce customized media more efficiently.

Veritone’s current media search and retrieval system relies on keyword matching of metadata generated from ML services, including information related to faces, sentiment, and objects. With recent advances in large language models (LLMs), Veritone has updated its platform with these powerful new AI capabilities. Looking ahead, Veritone wants to take advantage of new advanced FM techniques to improve the quality of media search results of “Digital Media Hub”( DMH ) and grow the number of users by achieving a better user experience.

In this post, we demonstrate how to use enhanced video search capabilities by enabling semantic retrieval of videos based on text queries. We match the most relevant videos to text-based search queries by incorporating new multimodal embedding models like Amazon Titan Multimodal Embeddings to encode all visual, visual-meta, and transcription data. The primary focus is building a robust text search that goes beyond traditional word-matching algorithms as well as an interface for comparing search algorithms. Additionally, we explore narrowing retrieval to specific shots within videos (a shot is a series of interrelated consecutive pictures taken contiguously by a single camera representing a continuous action in time and space). Overall, we aim to improve video search through cutting-edge semantic matching, providing an efficient way to find videos relevant to your rich textual queries.

Solution overview

We use the following AWS services to implement the solution:

Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon within a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

The current architecture consists of three components:

  • Metadata generation – This component generates metadata from a video archive, processes it, and creates embeddings for search indexing. The videos from Amazon S3 are retrieved and converted to H264 vcodec format using the FFmpeg library. The processed videos are sent to AWS services like Amazon Rekognition, Amazon Transcribe, and Amazon Comprehend to generate metadata at shot level and video level. We use the Amazon Titan Text and Multimodal Embeddings models to embed the metadata and the video frames and index them in OpenSearch Service. We use AWS Step Functions to orchestrate the entire pipeline.
  • Search – A UI-based video search pipeline takes in the user query as input and retrieves relevant videos. The user query invokes a Lambda function. Based on the search method selected, you either perform a text- or keyword-based search or an embedding-based search. The search body is sent to OpenSearch Service to retrieve video results at the shot level, which is displayed to the user.
  • Evaluation – The UI enables you to perform qualitative evaluation against different search settings. You enter a query and, based on the search settings, video results are retrieved from OpenSearch. You can view the results and provide feedback by voting for the winning setting.

The following diagram illustrates the solution architecture.

The high-level takeaways from this work are the following:

  • Using an Amazon Rekognition API to detect shots and index them achieved better retrieving recall (at least 50% improvement) than performing the same on the video level
  • Incorporating the Amazon Titan Text Embeddings model to semantically retrieve the video results instead of using raw text generated by Amazon Rekognition and Amazon Transcribe boosted the recall performance by 52%
  • The Amazon Titan Multimodal Embeddings model showed high capability to encode visual information of video image frames and achieved the best performance when combined with text embeddings of Amazon Rekognition and Amazon Transcribe text metadata, improving on baseline metrics by up to three times
  • The A/B evaluation UI that we developed to test new search methods and features proved to be effective

Detailed quantitative analysis of these conclusions is discussed later in this post.

Metadata generation pipeline

The video metadata generation pipeline consists of processing video files using AWS services such as Amazon Transcribe, Amazon Rekognition, and Amazon Comprehend, as shown in the following diagram. The metadata is generated at the shot level for a video.

In this section, we discuss the details of each service and the workflow in more detail.

Amazon Transcribe

The transcription for the entire video is generated using the StartTranscriptionJob API. When the job is complete, you can obtain the raw transcript data using GetTranscriptionJob. The GetTranscriptionJob returns a TranscriptFileUri, which can be processed to get the speakers and transcripts based on a timestamp. The file formats supported by Amazon Transcribe are AMR, FLAC (recommended), M4A, MP3, MP4, Ogg, WebM, and WAV (recommended).

The raw transcripts are further processed to be stored using timestamps, as shown in the following example.

Amazon Rekognition

Amazon Rekognition requires the video to be encoded using the H.264 codec and formatted to either MPEG-4 or MOV. We used FFmpeg to format the videos in Amazon S3 to the required vcodec. FFmpeg is a free and open-source software project in the form of a command line tool designed for processing video, audio, and other multimedia files and streams. Python provides a wrapper library around the tool called ffmpeg-python.

The solution runs Amazon Rekognition APIs for label detection, text detection, celebrity detection, and face detection on videos. The metadata generated for each video by the APIs is processed and stored with timestamps. The videos are then segmented into individual shots. With Amazon Rekognition, you can detect the start, end, and duration of each shot as well as the total shot count for a content piece. The video shot detection job starts with the StartSegmentDetection API, which returns a jobId that can be used to monitor status with the GetSegmentDetection API. When the video segmentation status changes to Succeeded, for each shot, you parse the previously generated Amazon Rekognition API metadata using the shot’s timestamp. You then append this parsed metadata to the shot record. Similarly, the full transcript from Amazon Transcribe is segmented using the shot start and end timestamps to create shot-level transcripts.

Amazon Comprehend

The temporal transcripts are then processed by Amazon Comprehend to detect entities and sentiments using the DetectEntities, DetectSentiment, and DetectTargetedSentiment APIs. The following code gives more details on the API requests and responses used to generate metadata by using sample shot-level metadata generated for a video:

Metadata processing

The shot-level metadata generated by the pipeline is processed to stage it for embedding generation. The goal of this processing is to aggregate useful information and remove null or less significant information that wouldn’t add value for embedding generation.

The processing algorithm is as follows:

rekognition_metadata
  - shot_metadata: extract StartFrameNumber and EndFrameNumber
  - celeb_metadata: extract celeb_metadata
  - label_metadata: extract unique labels
  - text_metadata: extract unique text labels if there are more than 3 words (comes noisy with "-", "null" and other values)
  - face_analysis_metadata: extract unique list of AgeRange, Emotions, Gender
We combine all rekognition text data into `rek_text_metadata` string
transcribe_metadata
  - transcribe_metadata: check the wordcount of the conversation across all speakers.
if it is more than 50 words, mark it for summarization task with Amazon Bedrock
comprehend_metadata
  - comprehend_metadata: extract sentiment
  - comprehend_metadata: extract target sentiment scores for words with score > 0.9

Large transcript summarization

Large transcripts from the processed metadata are summarized through the Anthropic Claude 2 model. After summarizing the transcript, we extract the names of the key characters mentioned in the summary as well the important keywords.

Embeddings generation

In this section, we discuss the details for generating shot-level and video-level embeddings.

Shot-level embeddings

We generate two types of embeddings: text and multimodal. To understand which metadata and service contributes to the search performance and by how much, we create a varying set of embeddings for experimental analysis.

We implement the following with Amazon Titan Multimodal Embeddings:

  • Embed image:
    • TMM_shot_img_embs – We sample the middle frame from every shot and embed them. We assume the middle frame in the shot captures the semantic nuance in the entire shot. You can also experiment with embedding all the frames and averaging them.
    • TMM_rek_text_shot_emb – We sample the middle frame from every shot and embed it along with Amazon Rekognition text data.
    • TMM_transcribe_shot_emb – We sample the middle frame from every shot and embed it along with Amazon Transcribe text data.
  • Embed text (to compare if the text data is represented well with the LLM or multimodal model, we also embed them with Amazon Titan Multimodal):
    • TMM_rek_text_emb – We embed the Amazon Rekognition text as multimodal embeddings without the images.
    • TMM_transcribe_emb – We embed the Amazon Transcribe text as multimodal embeddings without the images.

We implement the following with the Amazon Titan Text Embeddings model:

  • Embed text:
    • TT_rek_text_emb – We embed the Amazon Rekognition text as text embeddings
    • TT_transcribe_emb – We embed the Amazon Transcribe text as text embeddings

Video-level embeddings

If a video has only one shot (a small video capturing a single action), the embeddings will be the same as shot-level embeddings.

For videos that have more than one shot, we implement the following using the Amazon Titan Multimodal Embeddings Model:

  • Embed image:
    • TMM_shot_img_embs – We sample K images with replacement across all the shot-level metadata, generate embeddings, and average them
    • TMM_rek_text_shot_emb – We sample K images with replacement across all the shot-level metadata, embed it along with Amazon Rekognition text data, and average them.
    • TMM_transcribe_shot_emb – We sample K images with replacement across all the shot-level metadata, embed it along with Amazon Transcribe text data, and average them
  • Embed text:
    • TMM_rek_text_emb – We combine all the Amazon Rekognition text data and embed it as multimodal embeddings without the images
    • TMM_transcribe_emb – We combine all the Amazon Transcribe text data and embed it as multimodal embeddings without the images

We implement the following using the Amazon Titan Text Embeddings model:

  • Embed text:
    • TT_rek_text_emb – We combine all the Amazon Rekognition text data and embed it as text embeddings
    • TT_transcribe_emb – We combine all the Amazon Transcribe text data and embed it as text embeddings

Search pipeline

In this section, we discuss the components of the search pipeline.

Search index creation

We use an OpenSearch cluster (OpenSearch Service domain) with t3.medium.search to store and retrieve indexes for our experimentation with text, knn_vector, and Boolean fields indexed. We recommend exploring Amazon OpenSearch Serverless for production deployment for indexing and retrieval. OpenSearch Serverless can index billions of records and has expanded its auto scaling capabilities to efficiently handle tens of thousands of query transactions per minute.

The following screenshots are examples of the text, Boolean, and embedding fields that we created.

Query flow

The following diagram illustrates the query workflow.

You can use a user query to compare the video records using text or semantic (embedding) search for retrieval.

For text-based retrieval, we use the search query as input to retrieve results from OpenSearch Service using the search fields transcribe_metadata, transcribe_summary, transcribe_keyword, transcribe_speakers, and rek_text_metadata:

OpenSearch Input

search_fields=[
    "transcribe_metadata",
    "transcribe_summary",
    "transcribe_keyword",
    "transcribe_speakers",
    "rek_text_metadata"
]
search_body = { 
   "query": { 
      "multi_match": { 
          "query": search_query, 
          "fields": search_fields 
      } 
   } 
}

For semantic retrieval, the query is embedded using the amazon.Titan-embed-text-v1 or amazon.titan-embed-image-v1 model, which is then used as an input to retrieve results from OpenSearch Service using the search field name, which could match with the metadata embedding of choice:

OpenSearch Input

search_body = {
        "size": <number of top results>,
        "fields": ["name"],
        "query": {
            "knn": {
                vector_field: {"vector": <embedding>, "k": <length of embedding>}
            }
       },
}

Search results combination

Exact match and semantic search have their own benefits depending on the application. Users who search for a specific celebrity or movie name would benefit from an exact match search, whereas users looking for thematic queries like “summer beach vibes” and “candlelit dinner” would find semantic search results more applicable. To enable the best of both, we combine the results from both types of searches. Additionally, different embeddings could capture different semantics (for example, Amazon Transcribe text embedding vs. image embedding with a multimodal model). Therefore, we also explore combining different semantic search results.

To combine search results from different search methods and different score ranges, we used the following logic:

  1. Normalize the scores from each results list independently to a common 0–1 range using rank_norm.
  2. Sum the weighted normalized scores for each result video from all the search results.
  3. Sort the results based on the score.
  4. Return the top K results.

We use the rank_norm method, where the score is calculated based on the rank of each video in the list. The following is the Python implementation of this method:

def rank_norm(results):
    n_results = len(results)
    normalized_results = {}
    for i, doc_id in enumerate(results.keys()):
        normalized_results[doc_id] = 1 - (i / n_results)
    ranked_normalized_results = sorted(
        normalized_results.items(), key=lambda x: x[1], reverse=True
    )
    return dict(ranked_normalized_results)

Evaluation pipeline

In this section, we discuss the components of the evaluation pipeline.

Search and evaluation UI

The following diagram illustrates the architecture of the search and evaluation UI.

The UI webpage is hosted in an S3 bucket and deployed using Amazon CloudFront distributions. The current approach uses an API key for authentication. This can be enhanced by using Amazon Cognito and registering users. The user can perform two actions on the webpage:

  • Search – Enter the query to retrieve video content
  • Feedback – Based on the results displayed for a query, vote for the winning method

We create two API endpoints using Amazon API Gateway: GET /search and POST /feedback. The following screenshot illustrates our UI with two retrieval methods that have been anonymized for the user for a bias-free evaluation.

GET /search

We pass two QueryStringParameters with this API call:

  • query – The user input query
  • method – The method the user is evaluating

This API is created with a proxy integration with a Lambda function invoked. The Lambda function processes the query and, based on the method used, retrieves results from OpenSearch Service. The results are then processed to retrieve videos from the S3 bucket and displayed on the webpage. In the search UI, we use a specific method (search setting) to retrieve results:

Request
?query=<>&method=<>

Response

{
    "results": [
        {"name": <video-name>, "score": <score>}, 
        {"name": <video-name>, "score": <score>},
        ...
    ]
}

The following is a sample request:

?query=candlelit dinner&method=MethodB

The following screenshot shows our results.

POST /feedback

Given a query, each method will have video content and the video name displayed on the webpage. Based on the relevance of the results, the user can vote if a particular method has better performance over the other (win or lose) or if the methods are tied. The API has a proxy connection to Lambda. Lambda stores these results into an S3 bucket. In the evaluation UI, you can analyze the method search results to find the best search configuration setting. The request body includes the following syntax:

Request Body

{
    "result": <winning method>,
    "searchQuery":<query>,
    "sessionId":<current-session-id>,
    "Method<>":{
        "methodType": <Type of method used>,
        "results":"[{"name":<video-name>,"score":<score>}]"},
    "Method<>":{
        "methodType": <Type of method used>,
        "results":"[{"name":"1QT426_s01","score":1.5053753}]"}
}

The following screenshot shows a sample request.

Experiments and results

In this section, we discuss the datasets used in our experiments and the quantitative and qualitative evaluations based on the results.

Short videos dataset

This dataset includes 500 videos with an average length of 20 seconds. Each video has manually written metadata such as keywords and descriptions. In general, the videos in this dataset are related to travel, vacations, and restaurants topics.

The majority of videos are less than 20 seconds and the maximum is 400 seconds, as illustrated in the following figure.

Long videos dataset

The second dataset has 300 high-definition videos with a video length ranging from 20–160 minutes, as illustrated in the following figure.

Quantitative evaluation

We use the following metrics in our quantitative evaluation:

  • Mean reciprocal rankMean reciprocal rank (MRR) measures the inverse of the position number of the most relevant item in search results.
  • Recall@topK – We measure recall at topk as the percentage of correctly retrieved video out of the desired video search results (ground truth). For example:

A, B, C are related (GT)
A, D, N, M, G are the TopK retrieved videos
Recall @TOP5 = 1/3

We compute these metrics using a ground truth dataset provided by Veritone that had mappings of search query examples to relevant video IDs.

The following table summarizes the top three retrieval methods from the long videos dataset (% improvement over baseline).

Methods Video Level: MRR vs. Video-level Baseline MRR Shot Level: MRR vs. Video-level Baseline MRR Video Level: Recall@top10 vs. Video-level Baseline Recall@top10 Shot Level: Recall@top10 vs. Video-level Baseline Recall@top10
Raw Text: Amazon Transcribe + Amazon Rekognition Baseline comparison N/A . .
Semantic: Amazon Transcribe + Amazon Rekognition 0.84% 52.41% 19.67% 94.00%
Semantic: Amazon Transcribe + Amazon Rekognition + Amazon Titan Multimodal 37.31% 81.19% 71.00% 93.33%
Semantic: Amazon Transcribe + Amazon Titan Multimodal 15.56% 58.54% 61.33% 121.33%

The following are our observations on the MRR and recall results:

  • Overall shot-level retrieval outperforms the video-level retrieval baseline across both MRR and recall metrics.
  • Raw text has lower MRR and recall scores than embedding-based search on both video and shot level. All three semantic methods show improvement in MRR and recall.
  • Combining semantic (Amazon Transcribe + Amazon Rekognition + Amazon Titan Multimodal) yields the best improvement across video MRR, shot MRR, and video recall metrics.

The following table summarizes the top three retrieval methods from the short videos dataset (% improvement over baseline).

Methods Video Level: MRR vs. Video-level Baseline MRR Shot Level: MRR vs. Video-level Baseline MRR Video Level: Recall@top10 vs. Video-Level Baseline Recall@top10 Shot Level: Recall@top10 vs. Video-level Baseline Recall@top10
Raw Text: Amazon Transcribe + Amazon Rekognition Baseline N/A Baseline N/A
Semantic: Amazon Titan Multimodal 226.67% 226.67% 373.57% 382.61%
Semantic: Amazon Transcribe + Amazon Rekognition + Amazon Titan Multimodal 100.00% 60.00% 299.28% 314.29%
Semantic: Amazon Transcribe + Amazon Titan Multimodal 53.33% 53.33% 307.21% 312.77%

We made the following observations on the MRR and recall results:

  • Encoding the videos using the Amazon Titan Multimodal Embeddings model alone yields the best result compared to adding just Amazon Transcribe, Amazon Transcribe + Rekognition, or Amazon Transcribe + Amazon Rekognition + Amazon Titan Multimodal Embeddings (due to lack of dialogue and scene changes in these short videos)
  • All semantic retrieval methods (2, 3, and 4) should at least have 53% improvement over the baseline
  • Although Amazon Titan Multimodal alone works well for this data, it should be noted that other metadata like Amazon Transcribe, Amazon Rekognition, and pre-existing human labels as semantic representation retrieval can be augmented with Amazon Titan Multimodal Embeddings to improve performance depending on the nature of the data

Qualitative evaluation

We evaluated the quantitative results from our pipeline to find matches with the ground truth shared by Veritone. However, there could be other relevant videos in the retrieved results from our pipeline that are not part of the ground truth, which could further improve some of these metrics. Therefore, to qualitatively evaluate our pipeline, we used an A/B testing framework, where a user can view results from two anonymized methods (the metadata used by the method is not exposed to reduce any bias) and rate which results were more aligned with the query entered.

The aggregated results across the method comparison were used to calculate the win rate to select the final embedding method for search pipeline.

The following methods were shortlisted based on Veritone’s interest to reduce multiple comparison methods.

Method Name (Exposed to User) Retrieval Type (Not Exposed to User)
Method E Just semantic Amazon Transcribe retrieval results
Method F Fusion of semantic Amazon Transcribe + Amazon Titan Multimodal retrieval results
Method G Fusion of semantic Amazon Transcribe + semantic Amazon Rekognition + Amazon Titan Multimodal retrieval results

The following table summarizes the quantitative results and winning rate.

 

Experiment Winning Method (Count of Queries) . .
Method E Method F Tie
Method E vs. Method F 10% 85% 5%
Method F Method G Tie
Method F vs. Method G 30% 60% 10%

Based on the results, we see that adding Amazon Titan Multimodal Embeddings to the transcription method (Method F) is better than just using semantic transcription retrieval (Method E). Adding Amazon Rekognition based retrieval results (Method G) improves over Method F.

Takeaways

We had the following key takeaways:

  • Enabling vector search indexing and retrieving instead of relying only on text matching with AI generated text metadata improves the search recall.
  • Indexing and retrieving videos at the shot level can boost performance and improve customer experience. Users can efficiently find precise clips matching their query rather than sifting through entire videos.
  • Multimodal representation of queries and metadata through models trained on both images and text have better performance over single modality representation from models trained on just textual data.
  • The fusion of text and visual cues significantly improves search relevance by capturing semantic alignments between queries and clips more accurately and semantically capturing the user search intent.
  • Enabling direct human comparison between retrieval models through A/B testing allows for inspecting and selecting the optimal approach. This can boost the confidence to ship new features or search methods to production.

Security best practices

We recommend the following security guidelines for building secure applications on AWS:

Conclusion

In this post, we showed how Veritone upgraded their classical search pipelines with Amazon Titan Multimodal Embeddings in Amazon Bedrock through a few API calls. We showed how videos can be indexed in different representations, text vs. text embeddings vs. multimodal embeddings, and how they can be analyzed to produce a robust search based on the data characteristics and use case.

If you are interested in working with the AWS Generative AI Innovation Center, please reach out to the GenAIIC.


About the Authors

Tim Camara is a Senior Product Manager on the Digital Media Hub team at Veritone. With over 15 years of experience across a range of technologies and industries, he’s focused on finding ways to use emerging technologies to improve customer experiences.

Mohamad Al Jazaery is an Applied Scientist at the Generative AI Innovation Center. As a scientist and tech lead, he helps AWS customers envision and build GenAI solutions to address their business challenges in different domains such as Media and Entertainment, Finance, and Lifestyle.


Meghana Ashok is a Machine Learning Engineer at the Generative AI Innovation Center. She collaborates closely with customers, guiding them in developing secure, cost-efficient, and resilient solutions and infrastructure tailored to their generative AI needs.

Divya Bhargavi is a Senior Applied Scientist Lead at the Generative AI Innovation Center, where she solves high-value business problems for AWS customers using generative AI methods. She works on image/video understanding and retrieval, knowledge graph augmented large language models, and personalized advertising use cases.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he uses his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

Information extraction with LLMs using Amazon SageMaker JumpStart

Information extraction with LLMs using Amazon SageMaker JumpStart

Large language models (LLMs) have unlocked new possibilities for extracting information from unstructured text data. Although much of the current excitement is around LLMs for generative AI tasks, many of the key use cases that you might want to solve have not fundamentally changed. Tasks such as routing support tickets, recognizing customers intents from a chatbot conversation session, extracting key entities from contracts, invoices, and other type of documents, as well as analyzing customer feedback are examples of long-standing needs.

What makes LLMs so transformative, however, is their ability to achieve state-of-the-art results on these common tasks with minimal data and simple prompting, and their ability to multitask. Rather than requiring extensive feature engineering and dataset labeling, LLMs can be fine-tuned on small amounts of domain-specific data to quickly adapt to new use cases. By handling most of the heavy lifting, services like Amazon SageMaker JumpStart remove the complexity of fine-tuning and deploying these models.

SageMaker JumpStart is a machine learning (ML) hub with foundation models (FMs), built-in algorithms, and prebuilt ML solutions that you can deploy with just a few clicks. With SageMaker JumpStart, you can evaluate, compare, and select FMs quickly based on predefined quality and responsibility metrics to perform tasks like article summarization and image generation.

This post walks through examples of building information extraction use cases by combining LLMs with prompt engineering and frameworks such as LangChain. We also examine the uplift from fine-tuning an LLM for a specific extractive task. Whether you’re looking to classify documents, extract keywords, detect and redact personally identifiable information (PIIs), or parse semantic relationships, you can start ideating your use case and use LLMs for your natural language processing (NLP).

Prompt engineering

Prompt engineering enables you to instruct LLMs to generate suggestions, explanations, or completions of text in an interactive way. Prompt engineering relies on large pretrained language models that have been trained on massive amounts of text data. At first glance, there might not be one best way to design a prompt, and different LLMs might work better or worse with different prompts. Therefore, prompts are often iteratively refined through trial and error to produce better results. As a starting point, you can refer to the model documentation which typically includes recommendations and best practices for prompting the model, and examples provided in SageMaker JumpStart.

In the following sections, we focus on the prompt engineering techniques required for extractive use cases. They help unlock the power of LLMs by providing helpful constraints and guide the model toward its intended behavior. We discuss the following use cases:

  • Sensitive information detection and redaction
  • Entity extraction; generic and specific entities with structured formats
  • Classification, using prompt engineering and fine-tuning

Before we explore these use cases, we need to set up our development environment.

Prerequisites

The source code accompanying this example is available in this GitHub repo. It consists of several Jupyter notebooks and a utils.py module. The utils.py module houses the shared code that is used throughout the notebooks.

The simplest way to run this example is by using Amazon SageMaker Studio with the Data Science 3.0 kernel or an Amazon SageMaker notebook instance with the conda_python3 kernel. For the instance type, you can choose the default settings.

In this example, we use ml.g5.2xlarge and ml.g5.48xlarge instances for endpoint usage, and ml.g5.24xlarge for training job usage. Use the Service Quotas console to make sure you have sufficient quotas for these instances in the Region where you’re running this example.

We use Jupyter notebooks throughout this post. Before we explore the examples, it’s crucial to confirm that you have the latest version of the SageMaker Python SDK. This SDK offers a user-friendly interface for training and deploying models on SageMaker. To install or upgrade to the latest version, run the following command in the first cell of your Jupyter notebook:

%pip install --quiet --upgrade sagemaker

Deploy Llama-2-70b-chat using SageMaker JumpStart

There are many LLMs available in SageMaker JumpStart to choose from. In this example, we use Llama-2-70b-chat, but you might use a different model depending on your use case. To explore the list of SageMaker JumpStart models, see JumpStart Available Model Table.

To deploy a model from SageMaker JumpStart, you can use either APIs, as demonstrated in this post, or use the SageMaker Studio UI. After the model is deployed, you can test it by asking a question from the model:

from sagemaker.jumpstart.model import JumpStartModel

model_id, model_version = "meta-textgeneration-llama-2-70b-f", "2.*"
endpoint_name = model_id
instance_type = "ml.g5.48xlarge"

model = JumpStartModel(
	model_id=model_id, model_version=model_version, role=role_arn
)
predictor = model.deploy(
	endpoint_name=endpoint_name, instance_type=instance_type
)

If no instance_type is provided, the SageMaker JumpStart SDK will select the default type. In this example, you explicitly set the instance type to ml.g5.48xlarge.

Sensitive data extraction and redaction

LLMs show promise for extracting sensitive information for redaction. This includes techniques such as prompt engineering, which includes priming the model to understand the redaction task, and by providing examples that can improve the performance. For example, priming the model by stating “redact sensitive information” and demonstrating a few examples of redacting names, dates, and locations can help the LLM infer the rules of the task.

More in-depth forms of priming the model include providing positive and negative examples, demonstrations of common errors, and in-context learning to teach the nuances of proper redaction. With careful prompt design, LLMs can learn to redact information while maintaining readability and utility of the document. In real-life applications, however, additional evaluation is often necessary to improve the reliability and safety of LLMs for handling confidential data. This is often achieved through the inclusion of human review, because no automated approach is entirely foolproof.

The following are a few examples of using prompt engineering for the extraction and redaction of PII. The prompt consists of multiple parts: the report_sample, which includes the text that you want to identify and mask the PII data within, and instructions (or guidance) passed on to the model as the system message.

report_sample = """
This month at AnyCompany, we have seen a significant surge in orders from a diverse clientele. On November 5th, 2023, customer Alice from US placed an order with total of $2190. Following her, on Nov 7th, Bob from UK ordered a bulk set of twenty-five ergonomic keyboards for his office setup with total of $1000. The trend continued with Jane from Australia, who on Nov 12th requested a shipment of ten high-definition monitors with total of $9000, emphasizing the need for environmentally friendly packaging. On the last day of that month, customer John, located in Singapore, finalized an order for fifteen USB-C docking stations, aiming to equip his design studio with the latest technology for total of $3600.
"""

system = """
Your task is to precisely identify Personally Identifiable Information (PII) and identifiable details, including name, address, and the person's country, in the provided text. Replace these details with exactly four asterisks (****) as the masking characters. Use '****' for masking text of any length. Only write the masked text in the response.
"""

In the following example, you define the llama2_chat function that encapsulates sending the prompt to the Llama-2 model. You reuse this function throughout the examples.

def llama2_chat(
    predictor,
    user,
    temperature=0.1,
    max_tokens=512,
    top_p=0.9,
    system=None,
):
    """Constructs the payload for the llama2 model, sends it to the endpoint,
    and returns the response."""

    inputs = []
    if system:
        inputs.append({"role": "system", "content": system})
    if user:
        inputs.append({"role": "user", "content": user})

    payload = {
        "inputs": [inputs],
        "parameters": {
            "max_new_tokens": max_tokens,
            "top_p": top_p,
            "temperature": temperature,
        },
    }
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    return response

Use the following code to call the function, passing your parameters:

response = utils.llama2_chat(
    predictor,
    system=system,
    user=report_sample,
)
print(utils.llama2_parse_output(response))

You get the following output:

This month at AnyCompany, we have seen a significant surge in orders from a diverse clientele. On November 5th, 2023, customer ***** from ***** placed an order with total of $2190. Following her, on Nov 7th, ***** from ***** ordered a bulk set of twenty-five ergonomic keyboards for his office setup with total of $1000. The trend continued with ***** from *****, who on Nov 12th requested a shipment of ten high-definition monitors with total of $9000, emphasizing the need for environmentally friendly packaging. On the last day of that month, customer *****, located in *****, finalized an order for fifteen USB-C docking stations, aiming to equip his design studio with the latest technology for total of $3600.

Entity extraction

Entity extraction is the process of identifying and extracting key information entities from unstructured text. This technique helps create structured data from unstructured text and provides useful contextual information for many downstream NLP tasks. Common applications for entity extraction include building a knowledge base, extracting metadata to use for personalization or search, and improving user inputs and conversation understanding within chatbots.

You can effectively use LLMs for entity extraction tasks through careful prompt engineering. With a few examples of extracting entities from text, explanatory prompts, and the desired output format, the model can learn to identify and extract entities such as people, organizations, and locations from new input texts. In the following examples, we demonstrate a few different entity extraction tasks ranging from simpler to more complex using prompt engineering with the Llama-2-70b-chat model you deployed earlier.

Extract generic entities

Use the following code to extract specific entities:

email_sample = "Hello, My name is John. Your AnyCompany Financial Services, LLC credit card account 1111-0000-1111-0008 has a minimum payment of $24.53 that is due by July 31st. Based on your autopay settings, we will withdraw your payment on the due date from your bank account number XXXXXX1111 with the routing number XXXXX0000. Customer feedback for Sunshine Spa, 123 Main St, Anywhere. Send comments to Alice at alice_aa@anycompany.com and Bob at bob_bb@anycompany.com. I enjoyed visiting the spa. It was very comfortable but it was also very expensive. The amenities were ok but the service made the spa a great experience."

system = """
Your task is to precisely identify any email addresses from the given text and then write them, one per line. Remember to ONLY write an email address if it's precisely spelled out in the input text. If there are no email addresses in the text, write "N/A". DO NOT write anything else.
"""

result = utils.llama2_chat(predictor, system=system, user=email_sample)
print(utils.llama2_parse_output(result))

You get the following output:

alice_aa@anycompany.com
bob_bb@anycompany.com

Extract specific entities in a structured format

Using the previous sample report, you can extract more complex information in a structured manner. This time, you provide a JSON template for the model to use and return the output in JSON format.

With LLMs generating JSON documents as output, you can effortlessly parse them into a range of other data structures. This enables simple conversions to dictionaries, YAML, or even Pydantic models using third-party libraries, such as LangChain’s PydanticOutputParser. You can see the implementation in the GitHub repo.

import json

system = """
Your task is to precisely extract information from the text provided, and format it according to the given JSON schema delimited with triple backticks. Only include the JSON output in your response. If a specific field has no available data, indicate this by writing `null` as the value for that field in the output JSON. In cases where there is no data available at all, return an empty JSON object. Avoid including any other statements in the response.

```
{json_schema}
```
"""

json_schema = """
{
    "orders":
        [
            {
                "name": "<customer_name>",
                "location": "<customer_location>",
                "order_date": "<order_date in format YYYY-MM-DD>",
                "order_total": "<order_total>",
                "order_items": [
                    {
                        "item_name": "<item_name>",
                        "item_quantity": "<item_quantity>"
                    }
                ]
            }
        ]
}
"""


response = utils.llama2_chat(
    predictor,
    system=system.format(json_schema=json_schema),
    user=report_sample,
)
json_str = utils.llama2_parse_output(response)
print(json_str)

You get the following output:

{
    "orders": [
        {
            "name": "Alice",
            "location": "US",
            "order_date": "2023-11-05",
            "order_total": 2190,
            "order_items": [
                {
                    "item_name": null,
                    "item_quantity": null
                }
            ]
        },
        {
            "name": "Bob",
            "location": "UK",
            "order_date": "2023-11-07",
            "order_total": 1000,
            "order_items": [
                {
                    "item_name": "ergonomic keyboards",
                    "item_quantity": 25
                }
            ]
        },
        {
            "name": "Jane",
            "location": "Australia",
            "order_date": "2023-11-12",
            "order_total": 9000,
            "order_items": [
                {
                    "item_name": "high-definition monitors",
                    "item_quantity": 10
                }
            ]
        },
        {
            "name": "John",
            "location": "Singapore",
            "order_date": "2023-11-30",
            "order_total": 3600,
            "order_items": [
                {
                    "item_name": "USB-C docking stations",
                    "item_quantity": 15
                }
            ]
        }
    ]
}

Classification using prompt engineering

LLMS can be a useful tool for information extraction tasks such as text classification. Common applications include classifying the intents of user interactions via channels such as email, chatbots, voice, and others, or categorizing documents to route their requests to downstream systems. The initial step involves identifying the intent or class of the user’s request or the document. These intents or classes could take many forms—from short single words to thousands of hierarchical classes and sub-classes.

In the following examples, we demonstrate prompt engineering on synthetic conversation data to extract intents. Additionally, we show how pre-trained models can be assessed to determine if fine-tuning is needed.

Let’s start with the following example. You have a list of customer interactions with an imaginary health and life insurance company. To start, use the Llama-2-70b-chat model you deployed in the previous section:

inference_instance_type = "ml.g5.48xlarge"

# Llama-2-70b chat
model_id, model_version = "meta-textgeneration-llama-2-70b-f", "2.*"
endpoint_name = model_id

predictor = utils.get_predictor(
    endpoint_name=endpoint_name,
    model_id=model_id,
    model_version=model_version,
    inference_instance_type=inference_instance_type,
)

The get_predictor function is a helper function that creates a predictor object from a model ID and version. If the specified endpoint doesn’t exist, it creates a new endpoint and deploy the model. If the endpoint already exists, it uses the existing endpoint.

customer_interactions = [
    """Hello, I've recently moved to a new state and I need to update my address for my health insurance policy.
Can you assist me with that?
""",
    """Good afternoon! I'm interested in adding dental coverage to my existing health plan.
Could you provide me the options and prices?
""",
    """I had a disappointing experience with the customer service yesterday regarding my claim.
I want to file a formal complaint and speak with a supervisor.
""",
]

system = """
Your task is to identify the customer intent from their interactions with support bot in the provided text. The intent output must not more than 4 words. If the intent is not clear, please provide a fallback intent of "unknown".
"""

def get_intent(system, customer_interactions):
    for customer_interaction in customer_interactions:
        response = utils.llama2_chat(
            predictor,
            system=system,
            user=customer_interaction,
        )
        content = utils.llama2_parse_output(response)
        print(content)
get_intent(system, customer_interactions)

You get the following output:

Update Address
Intent: Informational
Intent: Escalate issue

Looking at the output, these seem reasonable as the intents. However, the format and style of the intents can vary depending on the language model. Another limitation of this approach is that intents are not confined to a predefined list, which means the language model might generate and word the intents differently each time you run it.

To address this, you can use the in-context learning technique in prompt engineering to steer the model towards selecting from a predefined set of intents, or class labels, that you provide. In the following example, alongside the customer conversation, you include a list of potential intents and ask the model to choose from this list:

system = """
Your task is to identify the intent from the customer interaction with the support bot. Select from the intents provided in the following list delimited with ####. If the intent is not clear, please provide a fallback intent of "unknown". ONLY write the intent.

####
- information change
- add coverage
- complaint
- portal navigation
- free product upgrade
####
"""

get_intent(system, customer_interactions)

You get the following output:

information change
add coverage
complaint

Reviewing the results, it’s evident that the language model performs well in selecting the appropriate intent in the desired format.

Sub-intents and intent trees

If you make the preceding scenario more complex, as in many real-life use cases, intents can be designed in a large number of categories and also in a hierarchical fashion, which will make the classification tasks more challenging for the model. Therefore, you can further improve and modify your prompt to provide an example to the model, also known as n-shot learning, k-shot learning, or few-shot learning.

The following is the intent tree to use in this example. You can find its source code in the utils.py file in the code repository.

INTENTS = [
    {
        "main_intent": "profile_update",
        "sub_intents": [
            "contact_info",
            "payment_info",
            "members",
        ],
    },
    {
        "main_intent": "health_cover",
        "sub_intents": [
            "add_extras",
            "add_hospital",
            "remove_extras",
            "remove_hospital",
            "new_policy",
            "cancel_policy",
        ],
    },
    {
        "main_intent": "life_cover",
        "sub_intents": [
            "new_policy",
            "cancel_policy",
            "beneficiary_info",
        ],
    },
    {
        "main_intent": "customer_retention",
        "sub_intents": [
            "complaint",
            "escalation",
            "free_product_upgrade",
        ],
    },
    {
        "main_intent": "technical_support",
        "sub_intents": [
            "portal_navigation",
            "login_issues",
        ],
    },
]

Using the following prompt (which includes the intents), you can ask the model to pick from the provided list of intents:

system = """
Your task is to identify the intent from the customer interaction with the support bot. Identify the intent of the provided text using the list of provided intent tree delimited with ####. The intents are defined in classes and sub-classes. Write the intention with this format: <main-intent>:<sub-intent>. ONLY write the intent.

OUTPUT EXAMPLE:
profile_update:contact_info

OUTPUT EXAMPLE:
customer_retention:complaint

####
{intents}
####
"""

intents_json = json.dumps(utils.INTENTS, indent=4)
system = system.format(intents=intents_json)
get_intent(system, customer_interactions)

You get the following output:

profile_update:contact_info
health_cover:add_extras
customer_retention:complaint

Although LLMs can often correctly identify intent from a list of possible intents, they may sometimes produce additional outputs or fail to adhere to the exact intent structure and output schema. There are also scenarios where intents are not as straightforward as they initially seem or are highly specific to a business domain context that the model doesn’t fully comprehend.

As an example, in the following sample interaction, the customer ultimately wants to change their coverage, but their immediate question and interaction intent is to get help with portal navigation. Similarly, in the second interaction, the more appropriate intent is “free product upgrade” which the customer is requesting. However, the model is unable to detect these nuanced intents as accurately as desired.

customer_interactions = [
    "I want to change my coverage plan. But I'm not seeing where to do this on the online website. Could you please point me to it?",
    "I'm unhappy with the current benefits of my plan and I'm considering canceling unless there are better alternatives. What can you offer?",
]

get_intent(system, customer_interactions)

You get the following output:

profile_update:contact_info
customer_retention:complaint

Prompt engineering can often successfully extract specific intents from text. However, for some use cases, relying solely on prompt engineering has limitations. Scenarios where additional techniques beyond prompt engineering may be needed include:

  • Conversations with a large number of intent classes or long contexts that exceed the language model’s context window size, or making queries more computationally expensive
  • Desired outputs in specific formats that the model struggles to adopt
  • Enhancing model understanding of the domain or task to boost performance

In the following section, we demonstrate how fine-tuning can boost the accuracy of the LLM for the intent classification task attempted earlier.

Fine-tuning an LLM for classification

The following sections detail the fine-tuning process of the FlanT5-XL and Mistral 7B model using SageMaker JumpStart. We use the FlanT5-XL and Mistral 7B models to compare their accuracy. Both models are significantly smaller compared to the Llama-2-70b-Chat. The goal is to determine whether smaller models can achieve state-of-the-art performance on specific tasks after they’re fine-tuned.

We have fine-tuned both Mitral 7B and FlanT5-XL models. You can see the details of the Mistral 7b fine-tuning in the code repository. In the following, we outline the steps for fine-tuning and evaluating of FlanT5-XL.

Initially, you deploy (or reuse) the FlanT5 endpoint as the base_predictor, which represents the base model prior to any fine-tuning. Subsequently, you assess the performance of the models by comparing them after the fine-tuning process.

inference_instance_type = "ml.g5.2xlarge"

model_id , model_version= "huggingface-text2text-flan-t5-xl", "2.0.0"
base_endpoint_name = model_id

base_predictor = utils.get_predictor(
    endpoint_name=base_endpoint_name,
    model_id=model_id,
    model_version=model_version,
    inference_instance_type=inference_instance_type,
)

Prepare training data for fine-tuning

Preparing for fine-tuning requires organizing several files, including the dataset and template files. The dataset is structured to align with the required input format for fine-tuning. For example, each record in our training dataset adheres to the following structure:

{"query": "customer query", "response": "main-intent:sub-intent"}

In this example, you use a synthesized dataset comprising customer interactions with a fictional insurance company. To learn more about the data and gain access to it, refer to the source code.

intent_dataset_file = "data/intent_dataset.jsonl"
intent_dataset_train_file = "data/intent_dataset_train.jsonl"
intent_dataset_test_file = "data/intent_dataset_test.jsonl"
ft_template_file = "data/template.json"

The following is the prompt for fine-tuning. The prompt has the query parameter, which is set during the fine-tuning using the SageMaker JumpStart SDK.

FT_PROMPT = """Identify the intent classes from the given user query, delimited with ####. Intents are categorized into two levels: main intent and sub intent. In your response, provide only ONE set of main and sub intents that is most relevant to the query. Write your response ONLY in this format <main-intent>:<sub-intent>. ONLY Write the intention.

OUTPUT EXAMPLE:
profile_update:contact_info

OUTPUT EXAMPLE:
technical_support:portal_navigation

#### QUERY:
{query}
####
"""

The following creates a template file that will be used by the SageMaker JumpStart framework to fine-tune the model. The template has two fields, prompt and completion. These fields are used to pass labeled data to the model for the fine-tuning process.

template = {
    "prompt": utils.FT_PROMPT,
    "completion": "{response}",
}

with open(ft_template_file, "w") as f:
    json.dump(template, f)

The training data is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, setting the stage for the actual fine-tuning process.

train_data_location = utils.upload_train_and_template_to_s3(
    bucket_prefix="intent_dataset_flant5",
    train_path=intent_dataset_train_file,
    template_path=ft_template_file,
)

Fine-tune the model

Configure the JumpStartEstimator, specifying your chosen model and other parameters like instance type and hyperparameters (in this example, you use five epochs for the training). This estimator drives the fine-tuning process.

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
    model_id=model_id,
    disable_output_compression=True,
    instance_type="ml.g5.24xlarge",
    role=utils.get_role_arn(),
)

estimator.set_hyperparameters(
    instruction_tuned="True", epochs="5", max_input_length="1024"
)

estimator.fit({"training": train_data_location})

Deploy the fine-tuned model

After fine-tuning, deploy the fine-tuned model:

finetuned_endpoint_name = "flan-t5-xl-ft-infoext"
finetuned_model_name = finetuned_endpoint_name
# Deploying the finetuned model to an endpoint
finetuned_predictor = estimator.deploy(
    endpoint_name=finetuned_endpoint_name,
    model_name=finetuned_model_name,
)

Use the following code to test the fine-tuned model against its base model with ambiguous queries, which you saw in the previous section:

ambiguous_queries = [
    {
        "query": "I want to change my coverage plan. But I'm not seeing where to do this on the online site. Could you please show me how?",
        "main_intent": "techincal_support",
        "sub_intent": "portal_navigation",
    },
    {
        "query": "I'm unhappy with the current benefits of my plan and I'm considering canceling unless there are better alternatives. What can you offer?",
        "main_intent": "customer_retention",
        "sub_intent": "free_product_upgrade",
    },
]
for query in ambiguous_queries:
    question = query["query"]
    print("query:", question, "n")
    print(
        "expected intent:  ", f"{query['main_intent']}:{query['sub_intent']}"
    )

    prompt = utils.FT_PROMPT.format(query=question)
    response = utils.flant5(base_predictor, user=prompt, max_tokens=13)
    print("base model:  ", utils.parse_output(response))

    response = utils.flant5(finetuned_predictor, user=prompt, max_tokens=13)
    print("finetuned model:  ", utils.parse_output(response))
    print("-" * 80)

You get the following output:

query: I want to change my coverage plan. But I'm not seeing where to do this on the online site. Could you please show me how?
expected intent:   techincal_support:portal_navigation
base model:   main_intent>:sub_intent> change
finetuned model:   technical_support:portal_navigation
--------------------------------------------------------------------------------
query: I'm unhappy with the current benefits of my plan and I'm considering canceling unless there are better alternatives. What can you offer?

expected intent:   customer_retention:free_product_upgrade
base model:   main_intent>:sub_intent> cancel
finetuned model:   customer_retention:free_product_upgrade
--------------------------------------------------------------------------------

As shown in this example, the fine-tuned model is able to classify the ambiguous queries correctly.

In evaluations, fine-tuned models performed better in identifying the correct class for both clear and ambiguous intents. The following section details the benchmark’s performance overall, and against each intent.

Performance comparisons and considerations

In this section, we have gathered the evaluation results and performance benchmarks for each model, before and after fine-tuning, as well as a comparison between the prompt engineering and fine-tuning the LLM. The dataset consists of 7,824 examples, with a 90% split for training (including validation) and 10% for testing.

Model Overall Accuracy Fine-tuning Duration (minutes) Notes
Mistral-7b (fine-tuned five epochs, without classes in the prompt) 98.97% 720

Given Mistral-7b’s nature as a text generation model, parsing its output to extract intent can be challenging due to tendencies for character repetition and generation of additional characters.

Improved performance with more epochs: 98% accuracy for five epochs compared to 92% for one epoch.

Flan-T5-XL (fine-tuned one epochs, without classes in the prompt) 98.46% 150 Marginal improvement in accuracy with increased epochs: from 97.5% (one epoch) to 98.46% (five epochs).
Llama-2-70b-chat (With classes in the prompt) 78.42% N/A Low accuracy in ambiguous scenarios.
Llama-2-70b-chat (Without classes in the prompt) 10.85% N/A .
Flan-T5-XL (base model, without classes in the prompt) 0.0% N/A Unable to identify any of the intent classes with the expected format.
Mistral-7b (base model, without classes in the prompt) 0.0% N/A Unable to identify any of the intent classes with the expected format.

The following table contains a breakdown of models’ accuracy for each intent class.

Main Intent Sub-intent Example Count Llama2-70b (without classes in prompt) Llama2-70b (with classes in prompt) Flant5-XL
Fine-tuned
Mistral-7b Fine-tuned
Customer Retention Complaint 63 7.94% 44.44% 98.41% 98.41%
Customer Retention Escalation 49 91.84% 100% 100% 100%
Customer Retention Free Product Upgrade 50 0.00% 64.00% 100% 100%
Health Cover Add Extras 38 0.00% 100% 97.37% 100%
Health Cover Add Hospital 44 0.00% 81.82% 100% 97.73%
Health Cover Cancel Policy 43 0.00% 100% 100% 97.67%
Health Cover New Policy 41 0.00% 82.93% 100% 100%
Health Cover Remove Extras 47 0.00% 85.11% 100% 100%
Health Cover Remove Hospital 53 0.00% 84.90% 100% 100%
Life Cover Beneficiary Info 45 0.00% 100% 97.78% 97.78%
Life Cover Cancel Policy 47 0.00% 55.32% 100% 100%
Life Cover New Policy 40 0.00% 90.00% 92.50% 100%
Profile Update Contact Info 45 35.56% 95.56% 95.56% 95.56%
Profile Update Members 52 0.00% 36.54% 98.08% 98.08%
Profile Update Payment Info 47 40.43% 97.87% 100% 100%
Technical Support Login Issues 39 0.00% 92.31% 97.44% 100%
Technical Support Portal Navigation 40 0.00% 45.00% 95.00% 97.50%

This comparative analysis illustrates the trade-offs between fine-tuning time and model accuracy. It highlights the ability of models like Mistral-7b and FlanT5-XL to achieve higher classification accuracy through fine-tuning. Additionally, it shows how smaller models can match or surpass the performance of larger models on specific tasks when fine-tuned, contrasted with using prompt engineering alone on the larger models.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the SageMaker endpoints, configuration, and models.
  2. Delete the S3 bucket created for this example.
  3. Delete the SageMaker notebook instance (if you used one to run this example).

Summary

Large language models have revolutionized information extraction from unstructured text data. These models excel in tasks such as classifying information and extracting key entities from various documents, achieving state-of-the-art results with minimal data.

This post demonstrated the use of large language models for information extraction through prompt engineering and fine-tuning. While effective, relying solely on prompt engineering can have limitations for complex tasks that require rigid output formats or a large number of classes. In these scenarios, fine-tuning even smaller models on domain-specific data can significantly improve performance beyond what prompt engineering alone can achieve.

The post included practical examples highlighting how fine-tuned smaller models can surpass prompt engineering with larger models for such complex use cases. Although prompt engineering is a good starting point for simpler use cases, fine-tuning offers a more robust solution for complex information extraction tasks, ensuring higher accuracy and adaptability to specific use cases. SageMaker JumpStart tools and services facilitate this process, making it accessible for individuals and teams across all levels of ML expertise.

Additional reading

You can read more on using SageMaker JumpStart for intelligent document processing, fine-tuning, and evaluation of LLMs in the following resources:


About the Authors

Pooya Vahidi  is a Senior Solutions Architect at AWS, passionate about computer science, artificial intelligence, and cloud computing. As an AI professional, he is an active member of the AWS AI/ML Area-of-Depth team. With a background spanning over two decades of expertise in leading the architecture and engineering of large-scale solutions, he helps customers on their transformative journeys through cloud and AI/ML technologies.

Dr. Romina Sharifpour is a Senior Machine Learning and Artificial Intelligence Solutions Architect at Amazon Web Services (AWS). She has spent over 10 years leading the design and implementation of innovative end-to-end solutions enabled by advancements in ML and AI. Romina’s areas of interest are natural language processing, large language models, and MLOps.

Read More

AWS Inferentia and AWS Trainium deliver lowest cost to deploy Llama 3 models in Amazon SageMaker JumpStart

AWS Inferentia and AWS Trainium deliver lowest cost to deploy Llama 3 models in Amazon SageMaker JumpStart

Today, we’re excited to announce the availability of Meta Llama 3 inference on AWS Trainium and AWS Inferentia based instances in Amazon SageMaker JumpStart. The Meta Llama 3 models are a collection of pre-trained and fine-tuned generative text models. Amazon Elastic Compute Cloud (Amazon EC2) Trn1 and Inf2 instances, powered by AWS Trainium and AWS Inferentia2, provide the most cost-effective way to deploy Llama 3 models on AWS. They offer up to 50% lower cost to deploy than comparable Amazon EC2 instances. They not only reduce the time and expense involved in training and deploying large language models (LLMs), but also provide developers with easier access to high-performance accelerators to meet the scalability and efficiency needs of real-time applications, such as chatbots and AI assistants.

In this post, we demonstrate how easy it is to deploy Llama 3 on AWS Trainium and AWS Inferentia based instances in SageMaker JumpStart.

Meta Llama 3 model on SageMaker Studio

SageMaker JumpStart provides access to publicly available and proprietary foundation models (FMs). Foundation models are onboarded and maintained from third-party and proprietary providers. As such, they are released under different licenses as designated by the model source. Be sure to review the license for any FM that you use. You are responsible for reviewing and complying with applicable license terms and making sure they are acceptable for your use case before downloading or using the content.

You can access the Meta Llama 3 FMs through SageMaker JumpStart on the Amazon SageMaker Studio console and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all machine learning (ML) development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Get Started with SageMaker Studio.

On the SageMaker Studio console, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane. If you’re using SageMaker Studio Classic, refer to Open and use JumpStart in Studio Classic to navigate to the SageMaker JumpStart models.

From the SageMaker JumpStart landing page, you can search for “Meta” in the search box.

Choose the Meta model card to list all the models from Meta on SageMaker JumpStart.

You can also find relevant model variants by searching for “neuron.” If you don’t see Meta Llama 3 models, update your SageMaker Studio version by shutting down and restarting SageMaker Studio.

No-code deployment of the Llama 3 Neuron model on SageMaker JumpStart

You can choose the model card to view details about the model, such as the license, data used to train, and how to use it. You can also find two buttons, Deploy and Preview notebooks, which help you deploy the model.

When you choose Deploy, the page shown in the following screenshot appears. The top section of the page shows the end-user license agreement (EULA) and acceptable use policy for you to acknowledge.

After you acknowledge the policies, provide your endpoint settings and choose Deploy to deploy the endpoint of the model.

Alternatively, you can deploy through the example notebook by choosing Open Notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

Meta Llama 3 deployment on AWS Trainium and AWS Inferentia using the SageMaker JumpStart SDK

In SageMaker JumpStart, we have pre-compiled the Meta Llama 3 model for a variety of configurations to avoid runtime compilation during deployment and fine-tuning. The Neuron Compiler FAQ has more details about the compilation process.

There are two ways to deploy Meta Llama 3 on AWS Inferentia and Trainium based instances using the SageMaker JumpStart SDK. You can deploy the model with two lines of code for simplicity, or focus on having more control of the deployment configurations. The following code snippet shows the simpler mode of deployment:

from sagemaker.jumpstart.model import JumpStartModel

model_id = "meta-textgenerationneuron-llama-3-8b"
accept_eula = True
model = JumpStartModel(model_id=model_id)
predictor = model.deploy(accept_eula=accept_eula) ## To set 'accept_eula' to be True to deploy 

To perform inference on these models, you need to specify the argument accept_eula as True as part of the model.deploy() call. This means you have read and accepted the EULA of the model. The EULA can be found in the model card description or from https://ai.meta.com/resources/models-and-libraries/llama-downloads/.

The default instance type for Meta LIama-3-8B is is ml.inf2.24xlarge. The other supported model IDs for deployment are the following:

  • meta-textgenerationneuron-llama-3-70b
  • meta-textgenerationneuron-llama-3-8b-instruct
  • meta-textgenerationneuron-llama-3-70b-instruct

SageMaker JumpStart has pre-selected configurations that can help get you started, which are listed in the following table. For more information about optimizing these configurations further, refer to advanced deployment configurations

LIama-3 8B and LIama-3 8B Instruct
Instance type

OPTION_N_POSITI

ONS

OPTION_MAX_ROLLING_BATCH_SIZE OPTION_TENSOR_PARALLEL_DEGREE OPTION_DTYPE
ml.inf2.8xlarge 8192 1 2 bf16
ml.inf2.24xlarge (Default) 8192 1 12 bf16
ml.inf2.24xlarge 8192 12 12 bf16
ml.inf2.48xlarge 8192 1 24 bf16
ml.inf2.48xlarge 8192 12 24 bf16
LIama-3 70B and LIama-3 70B Instruct
ml.trn1.32xlarge 8192 1 32 bf16
ml.trn1.32xlarge
(Default)
8192 4 32 bf16

The following code shows how you can customize deployment configurations such as sequence length, tensor parallel degree, and maximum rolling batch size:

from sagemaker.jumpstart.model import JumpStartModel

model_id = "meta-textgenerationneuron-llama-3-70b"
model = JumpStartModel(
    model_id=model_id,
    env={
        "OPTION_DTYPE": "bf16",
        "OPTION_N_POSITIONS": "8192",
        "OPTION_TENSOR_PARALLEL_DEGREE": "32",
        "OPTION_MAX_ROLLING_BATCH_SIZE": "4", 
    },
    instance_type="ml.trn1.32xlarge"  
)
## To set 'accept_eula' to be True to deploy 
pretrained_predictor = model.deploy(accept_eula=False)

Now that you have deployed the Meta Llama 3 neuron model, you can run inference from it by invoking the endpoint:

payload = {
    "inputs": "I believe the meaning of life is",
    "parameters": {
        "max_new_tokens": 64,
        "top_p": 0.9,
        "temperature": 0.6,
    },
}

response = pretrained_predictor.predict(payload)

Output: 

I believe the meaning of life is
>  to be happy. I believe that happiness is a choice. I believe that happiness 
is a state of mind. I believe that happiness is a state of being. I believe that 
happiness is a state of being. I believe that happiness is a state of being. I 
believe that happiness is a state of being. I believe

For more information on the parameters in the payload, refer to Detailed parameters.

Refer to Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium for details on how to pass the parameters to control text generation.

Clean up

After you have completed your training job and don’t want to use the existing resources anymore, you can delete the resources using the following code:

# Delete resources
# Delete the fine-tuned model
predictor.delete_model()

# Delete the fine-tuned model endpoint
predictor.delete_endpoint()

Conclusion

The deployment of Meta Llama 3 models on AWS Inferentia and AWS Trainium using SageMaker JumpStart demonstrates the lowest cost for deploying large-scale generative AI models like Llama 3 on AWS. These models, including variants like Meta-Llama-3-8B, Meta-Llama-3-8B-Instruct, Meta-Llama-3-70B, and Meta-Llama-3-70B-Instruct, use AWS Neuron for inference on AWS Trainium and Inferentia. AWS Trainium and Inferentia offer up to 50% lower cost to deploy than comparable EC2 instances.

In this post, we demonstrated how to deploy Meta Llama 3 models on AWS Trainium and AWS Inferentia using SageMaker JumpStart. The ability to deploy these models through the SageMaker JumpStart console and Python SDK offers flexibility and ease of use. We are excited to see how you use these models to build interesting generative AI applications.

To start using SageMaker JumpStart, refer to Getting started with Amazon SageMaker JumpStart. For more examples of deploying models on AWS Trainium and AWS Inferentia, see the GitHub repo. For more information on deploying Meta Llama 3 models on GPU-based instances, see Meta Llama 3 models are now available in Amazon SageMaker JumpStart.


About the Authors

Xin Huang is a Senior Applied Scientist
Rachna Chadha is a Principal Solutions Architect – AI/ML
Qing Lan is a Senior SDE – ML System
Pinak Panigrahi is a Senior Solutions Architect Annapurna ML
Christopher Whitten is a Software Development Engineer
Kamran Khan is a Head of BD/GTM Annapurna ML
Ashish Khetan is a Senior Applied Scientist
Pradeep Cruz is a Senior SDM

Read More

Revolutionize Customer Satisfaction with tailored reward models for your business on Amazon SageMaker

Revolutionize Customer Satisfaction with tailored reward models for your business on Amazon SageMaker

As more powerful large language models (LLMs) are used to perform a variety of tasks with greater accuracy, the number of applications and services that are being built with generative artificial intelligence (AI) is also growing. With great power comes responsibility, and organizations want to make sure that these LLMs produce responses that align with their organizational values and provide the same unique experience they always intended for their end-customers.

Evaluating AI-generated responses presents challenges. This post discusses techniques to align them with company values and build a custom reward model using Amazon SageMaker. By doing so, you can provide customized customer experiences that uniquely reflect your organization’s brand identity and ethos.

Challenges with out-of-the-box LLMs

Out-of-the-box LLMs provide high accuracy, but often lack customization for an organization’s specific needs and end-users. Human feedback varies in subjectivity across organizations and customer segments. Collecting diverse, subjective human feedback to refine LLMs is time-consuming and unscalable.

This post showcases a reward modeling technique to efficiently customize LLMs for an organization by programmatically defining rewards functions that capture preferences for model behavior. We demonstrate an approach to deliver LLM results tailored to an organization without intensive, continual human judgement. The techniques aim to overcome customization and scalability challenges by encoding an organization’s subjective quality standards into a reward model that guides the LLM to generate preferable outputs.

Objective vs. subjective human feedback

Not all human feedback is the same. We can categorize human feedback into two types: objective and subjective.

Any human being who is asked to judge the color of the following boxes would confirm that the left one is a white box and right one is a black box. This is objective, and there are no changes to it whatsoever.

Determining whether an AI model’s output is “great” is inherently subjective. Consider the following color spectrum. If asked to describe the colors on the ends, people would provide varied, subjective responses based on their perceptions. One person’s white may be another’s gray.

This subjectivity poses a challenge for improving AI through human feedback. Unlike objective right/wrong feedback, subjective preferences are nuanced and personalized. The same output could elicit praise from one person and criticism from another. The key is acknowledging and accounting for the fundamental subjectivity of human preferences in AI training. Rather than seeking elusive objective truths, we must provide models exposure to the colorful diversity of human subjective judgment.

Unlike traditional model tasks such as classification, which can be neatly benchmarked on test datasets, assessing the quality of a sprawling conversational agent is highly subjective. One human’s riveting prose is another’s aimless drivel. So how should we refine these expansive language models when humans intrinsically disagree on the hallmarks of a “good” response?

The key is gathering feedback from a diverse crowd. With enough subjective viewpoints, patterns emerge on engaging discourse, logical coherence, and harmless content. Models can then be tuned based on broader human preferences. There is a general perception that reward models are often associated only with Reinforcement Learning from Human Feedback (RLHF). Reward modeling, in fact, goes beyond RLHF, and can be a powerful tool for aligning AI-generated responses with an organization’s specific values and brand identity.

Reward modeling

You can choose an LLM and have it generate numerous responses to diverse prompts, and then your human labelers will rank those responses. It’s important to have diversity in human labelers. Clear labeling guidelines are critical. Without explicit criteria, judgments can become arbitrary. Useful dimensions include coherence, relevance, creativity, factual correctness, logical consistency, and more. Human labelers put these responses into categories and label them favorite to least favorite, as shown in the following example. This example showcases how different humans perceive these possible responses from the LLM in terms of their most favorite (labeled as 1 in this case) and least favorite (labeled as 3 in this case). Each column is labeled 1, 2, or 3 from each human to signify their most preferred and least preferred response from the LLM.

By compiling these subjective ratings, patterns emerge on what resonates across readers. The aggregated human feedback essentially trains a separate reward model on writing qualities that appeal to people. This technique of distilling crowd perspectives into an AI reward function is called reward modeling. It provides a method to improve LLM output quality based on diverse subjective viewpoints.

Solution overview

In this post, we detail how to train a reward model based on organization-specific human labeling feedback collected for various prompts tested on the base FM. The following diagram illustrates the solution architecture.

For more details, see the accompanying notebook.

Prerequisites

To successfully train a reward model, you need the following:

Launch SageMaker Studio

Complete the following steps to launch SageMaker Studio:

  1. On the SageMaker console, choose Studio in the navigation pane.
  2. On the Studio landing page, select the domain and user profile for launching Studio.
  3. Choose Open Studio.
  4. To launch SageMaker Studio, choose Launch personal Studio.

Let’s see how to create a reward model locally in a SageMaker Studio notebook environment by using a pre-existing model from the Hugging Face model hub.

Prepare a human-labeled dataset and train a reward model

When doing reward modeling, getting feedback data from humans can be expensive. This is because reward modeling needs feedback from other human workers instead of only using data collected during regular system use. How well your reward model behaves depends on the quality and amount of feedback from humans.

We recommend using AWS-managed offerings such as Amazon SageMaker Ground Truth. It offers the most comprehensive set of human-in-the-loop capabilities, allowing you to harness the power of human feedback across the machine learning (ML) lifecycle to improve the accuracy and relevancy of models. You can complete a variety of human-in-the-loop tasks with SageMaker Ground Truth, from data generation and annotation to model review, customization, and evaluation, either through a self-service or AWS-managed offering.

For this post, we use the IMDB dataset to train a reward model that provides a higher score for text that humans have labeled as positive, and a lower score for negative text.

We prepare the dataset with the following code:

def create_custom_dataset(raw_dataset):
    df = raw_dataset.to_pandas()
    negative_df = df[df['label']==0]
    positive_df = df[df['label']==1]
    negative_df = negative_df.drop(
        columns=['label']).rename(
        columns={'text': 'rejected'})
    # shuffle the data
    positive_df = positive_df.sample(
        frac=1, random_state=0).reset_index(
        drop=True).drop(columns=['label']).rename(
        columns={'text': 'chosen'})
    joined_df = negative_df.join(positive_df)

    def tokenize_fn(texts, max_length=args.seq_length):
        encoded = tokenizer(
            texts,
            padding='max_length',
            max_length=max_length,
            truncation=True,
            add_special_tokens=False,
        )
        return encoded

    rejected_encoded = tokenize_fn(joined_df.rejected.values.tolist())
    joined_df['rejected_input_ids'] = rejected_encoded['input_ids']
    joined_df['rejected_attention_mask'] = rejected_encoded['attention_mask']
    encoded_chosen = tokenize_fn(joined_df.chosen.values.tolist())
    joined_df['chosen_input_ids'] = encoded_chosen['input_ids']
    joined_df['chosen_attention_mask'] = encoded_chosen['attention_mask']
    
    train_dataset = Dataset.from_pandas(joined_df, preserve_index=False)
    
    return train_dataset.with_format("torch")

The following example shows a sample record from the prepared dataset, which includes references to rejected and chosen responses. We have also embedded the input ID and attention mask for the chosen and rejected responses.

{'rejected': "If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />",
 'chosen': "This is a great movie. I love it more each time i watch. Most comedies can get pretty lame because you know all the gags, but mystery men has so much integrity in the writing and characterization that watching once again -- as Ben Stiller tears at the hood ornament of the limo, or Hank Azaria says good-bye to Louise Lasser, or Geoffrey Rush flashes his fuhrer choreography, or Tom Waits mumbles while he watches the news report, or Janeane Garofalo refuses a kiss from Paul Reubens -- is a pleasure. This is pitch perfect ensemble acting. The story develops directly and consistently, the action sequences are creative and not too dominant, all the set-ups payoff by the end. Seriously, if you've seen it and it's been a while, watch it again, and if you haven't then get started. You can't watch it again until you've seen it the first time. (Wes Studi, William H. Macy, the tryouts scene. Too much good stuff!)",
 'rejected_input_ids': tensor([1106,  129,    7,  ...,    1,    1,    1]),
 'rejected_attention_mask': tensor([1, 1, 1,  ..., 0, 0, 0]),
 'chosen_input_ids': tensor([713,  16,  10,  ...,   1,   1,   1]),
 'chosen_attention_mask': tensor([1, 1, 1,  ..., 0, 0, 0])}

Load the pre-trained model

In this case, we use the OPT-1.3b (Open Pre-trained Transformer Language Model) model in Amazon SageMaker JumpStart from Hugging Face. If you want to do all of the training locally on your notebook instead of distributed training, you need to use an instance with enough accelerator memory. We run the following training on a notebook running on ml.g4dn.xlarge instance type:

from transformers import( 
      AutoModelForSequenceClassification, 
      AutoTokenizer, 
      set_seed, 
      ) 
from datasets import Dataset, load_dataset 
import torch
       
model = AutoModelForSequenceClassification.from_pretrained( 
       'facebook/opt-1.3b',
       torch_dtype=torch.bfloat16, 
       device_map="auto", 
       num_labels=1, 
 )

Define the custom trainer function

In the following code snippet, we create a custom trainer that calculates how well a model is performing on a task:

from torch import nn 
from transformers import Trainer 
import torch.nn.functional as F 

class CustomTrainer(Trainer): 
def compute_loss(self, model, inputs, return_outputs=False): 

chosen_input_ids = inputs['chosen_input_ids'] chosen_attention_mask = inputs['chosen_attention_mask'] rejected_input_ids = inputs['rejected_input_ids'] rejected_attention_mask = inputs['rejected_attention_mask'] 
r_w = model(chosen_input_ids, chosen_attention_mask).logits 
r_l = model(rejected_input_ids, rejected_attention_mask).logits outputs = (r_w, r_l) 
loss = -F.logsigmoid(r_w - r_l).mean() 
return (loss, outputs) if return_outputs else loss

It compares the model’s results for two sets of input data: one set that was chosen and another set that was rejected. The trainer then uses these results to figure out how good the model is at distinguishing between the chosen and rejected data. This helps the trainer adjust the model to improve its performance on the task. The CustomTrainer class is used to create a specialized trainer that calculates the loss function for a specific task involving chosen and rejected input sequences. This custom trainer extends the functionality of the standard Trainer class provided by the transformers library, allowing for a tailored approach to handling model outputs and loss computation based on the specific requirements of the task. See the following code:

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="reward_model",
                                  overwrite_output_dir=True,
                                 do_train=True,
                                 do_eval=False,
                                 do_predict=False,
                                 evaluation_strategy="no",
                                 learning_rate=5e-5,
                                 num_train_epochs=1,
                                 per_device_train_batch_size=2,
                                 gradient_accumulation_steps=32,
                                 remove_unused_columns=False)
trainer = CustomTrainer( 
          model=model, 
          args=training_args, 
          train_dataset=train_dataset 
          )
trainer.train()
trainer.save_model()

The TrainingArguments in the provided code snippet are used to configure various aspects of the training process for an ML model. Let’s break down the purpose of each parameter, and how they can influence the training outcome:

  • output_dir – Specifies the directory where the trained model and associated files will be saved. This parameter helps organize and store the trained model for future use.
  • overwrite_output_dir – Determines whether to overwrite the output directory if it already exists. Setting this to True allows for reusing the same directory without manual deletion.
  • do_train – Indicates whether to perform training. If set to True, the model will be trained using the provided training dataset.
  • do_eval and do_predict – Control whether to perform evaluation and prediction tasks, respectively. In this case, both are set to False, meaning only training will be conducted.
  • evaluation_strategy – Defines when evaluation should be performed during training. Setting it to “no” means evaluation will not be done during training.
  • learning_rate – Specifies the learning rate for the optimizer, influencing how quickly or slowly the model learns from the data.
  • num_train_epochs – Sets the number of times the model will go through the entire training dataset during training. One epoch means one complete pass through all training samples.
  • per_device_train_batch_size – Determines how many samples are processed in each batch during training on each device (for example, GPU). A smaller batch size can lead to slower but more stable training.
  • gradient_accumulation_steps – Controls how often gradients are accumulated before updating the model’s parameters. This can help stabilize training with large batch sizes.
  • remove_unused_columns – Specifies whether unused columns in the dataset should be removed before processing, optimizing memory usage.

By configuring these parameters in the TrainingArguments, you can influence various aspects of the training process, such as model performance, convergence speed, memory usage, and overall training outcome based on your specific requirements and constraints.

When you run this code, it trains the reward model based on the numerical representation of subjective feedback you gathered from the human labelers. A trained reward model will give a higher score to LLM responses that humans are more likely to prefer.

Use the reward model to evaluate the base LLM

You can now feed the response from your LLM to this reward model, and the numerical score produced as output informs you of how well the response from the LLM is aligning to the subjective organization preferences that were embedded on the reward model. The following diagram illustrates this process. You can use this number as the threshold for deciding whether or not the response from the LLM can be shared with the end-user.

For example, let’s say we created an reward model to avoiding toxic, harmful, or inappropriate content. If a chatbot powered by an LLM produces a response, the reward model can then score the chatbot’s responses. Responses with scores above a pre-determined threshold are deemed acceptable to share with users. Scores below the threshold mean the content should be blocked. This lets us automatically filter chatbot content that doesn’t meet standards we want to enforce. To explore more, see the accompanying notebook.

Clean up

To avoid incurring future charges, delete all the resources that you created. Delete the deployed SageMaker models, if any, and stop the SageMaker Studio notebook you launched for this exercise.

Conclusion

In this post, we showed how to train a reward model that predicts a human preference score from the LLM’s response. This is done by generating several outputs for each prompt with the LLM, then asking human annotators to rank or score the responses to each prompt. The reward model is then trained to predict the human preference score from the LLM’s response. After the reward model is trained, you can use the reward model to evaluate the LLM’s responses against your subjective organizational standards.

As an organization evolves, the reward functions must evolve alongside changing organizational values and user expectations. What defines a “great” AI output is subjective and transforming. Organizations need flexible ML pipelines that continually retrain reward models with updated rewards reflecting latest priorities and needs. This space is continuously evolving: direct preference-based policy optimization, tool-augmented reward modeling, and example-based control are other popular alternative techniques to align AI systems with human values and goals.

We invite you to take the next step in customizing your AI solutions by engaging with the diverse and subjective perspectives of human feedback. Embrace the power of reward modeling to ensure your AI systems resonate with your brand identity and deliver the exceptional experiences your customers deserve. Start refining your AI models today with Amazon SageMaker and join the vanguard of businesses setting new standards in personalized customer interactions. If you have any questions or feedback, please leave them in the comments section.


About the Author

Dinesh Kumar Subramani is a Senior Solutions Architect based in Edinburgh, Scotland. He specializes in artificial intelligence and machine learning, and is member of technical field community with in Amazon. Dinesh works closely with UK Central Government customers to solve their problems using AWS services. Outside of work, Dinesh enjoys spending quality time with his family, playing chess, and exploring a diverse range of music.

Read More

Amazon Personalize launches new recipes supporting larger item catalogs with lower latency

Amazon Personalize launches new recipes supporting larger item catalogs with lower latency

Personalized customer experiences are essential for engaging today’s users. However, delivering truly personalized experiences that adapt to changes in user behavior can be both challenging and time-consuming. Amazon Personalize makes it straightforward to personalize your website, app, emails, and more, using the same machine learning (ML) technology used by Amazon, without requiring ML expertise. With the recipes—algorithms for specific uses cases—provided by Amazon Personalize, you can deliver a wide array of personalization, including product or content recommendations and personalized ranking.

Today, we are excited to announce the general availability of two advanced recipes in Amazon Personalize, User-Personalization-v2 and Personalized-Ranking-v2 (v2 recipes), which are built on the cutting-edge Transformers architecture to support larger item catalogs with lower latency.

In this post, we summarize the new enhancements, and guide you through the process of training a model and providing recommendations for your users.

Benefits of new recipes

The new recipes offer enhancements in scalability, latency, model performance, and functionality.

  • Enhanced scalability – The new recipes now support training with up to 5 million item catalogs and 3 billion interactions, empowering personalization for large catalogs and platforms with billions of usage events.
  • Lower latency – The lower inference latency and faster training times for large datasets of these new recipes can reduce the delay for your end-users.
  • Performance optimization – Amazon Personalize testing showed that v2 recipes improved recommendation accuracy by up to 9% and recommendation coverage by up to 1.8x compared to previous versions. A higher coverage means Amazon Personalize recommends more of your catalog.
  • Return item metadata in inference responses – The new recipes enable item metadata by default without extra charge, allowing you to return metadata such as genres, descriptions, and availability in inference responses. This can help you enrich recommendations in your user interfaces without extra work. If you use Amazon Personalize with generative AI, you can also feed the metadata into prompts. Providing more context to large language models can help them gain a deeper understanding of product attributes to generate more relevant content.
  • Highly automated operations – Our new recipes are designed to reduce your overhead for training and tuning the model. For example, Amazon Personalize simplifies training configuration and automatically selects the optimal settings for your custom models behind the scenes.

Solution overview

To use the User-Personalization-v2 and Personalized-Ranking-v2 recipes, you first need to set up Amazon Personalize resources. Create your dataset group, import your data, train a solution version, and deploy a campaign. For full instructions, see Getting started.

For this post, we follow the Amazon Personalize console approach to deploy a campaign. Alternatively, you can build the entire solution using the SDK approach. You can also get batch recommendations with an asynchronous batch flow. We use the MovieLens public dataset and User-Personalization-v2 recipe to show you the workflow.

Prepare the dataset

Complete the following steps to prepare your dataset:

  1. Create a dataset group. Each dataset group can contain up to three datasets: users, items, and interactions, with the interactions dataset being mandatory for User-Personalization-v2 and Personalized-Ranking-v2.
  2. Create an interactions dataset using a schema.
  3. Import the interactions data to Amazon Personalize from Amazon Simple Storage Service (Amazon S3).

Train a model

After the dataset import job is complete, you can analyze data before training. Amazon Personalize Data analysis shows you statistics about your data as well as actions you can take to meet training requirements and improve recommendations.

Now you’re ready to train your model.

  1. On the Amazon Personalize console, choose Dataset groups in the navigation pane.
  2. Choose your dataset group.
  3. Choose Create solutions.
  4. For Solution name, enter your solution name.
  5. For Solution type, select Item recommendation.
  6. For Recipe, choose the new aws-user-personalization-v2 recipe.
  7. In the Training configuration section, for Automatic training, select Turn on to maintain the effectiveness of your model by retraining it on a regular cadence.
  8. Under Hyperparameter configuration, select Apply recency bias. Recency bias determines whether the model should give more weight to the most recent item interactions data in your interactions dataset.
  9. Choose Create solution.

If you turned on automatic training, Amazon Personalize will automatically create your first solution version. A solution version refers to a trained ML model. When a solution version is created for the solution, Amazon Personalize trains the model backing the solution version based on the recipe and training configuration. It can take up to 1 hour for the solution version creation to start.

  1. Under Custom resources in the navigation pane, choose Campaigns.
  2. Choose Create campaign.

A campaign deploys a solution version (trained model) to generate real-time recommendations. Campaigns created with solutions trained on v2 recipes are automatically opted-in to include item metadata in recommendation results. You can choose metadata columns during an inference call.

  1. Provide your campaign details and create your campaign.

Get recommendations

After you create or update your campaign, you can get a recommended list of items that users are more likely to interact with, sorted from highest to lowest.

  1. Select the campaign and View details.
  2. In the Test campaign results section, enter the User ID and choose Get recommendations.

The following table shows a recommendation result for a user that includes the recommended items, relevance score, and item metadata (Title and Genre).

Your User-Personalization-v2 campaign is now ready to feed into your website or app and personalize the journey of each of your customers.

Clean up

Make sure you clean up any unused resources you created in your account while following the steps outlined in this post. You can delete campaigns, datasets, and dataset groups via the Amazon Personalize console or using the Python SDK.

Conclusion

The new Amazon Personalize User-Personalization-v2 and Personalized-Ranking-v2 recipes take personalization to the next level with support of larger item catalogs, reduced latency, and optimized performance. For more information about Amazon Personalize, see the Amazon Personalize Developer Guide.


About the Authors

Jingwen Hu is a Senior Technical Product Manager working with AWS AI/ML on the Amazon Personalize team. In her spare time, she enjoys traveling and exploring local food.

Daniel Foley is a Senior Product Manager for Amazon Personalize. He is focused on building applications that leverage artificial intelligence to solve our customers’ largest challenges. Outside of work, Dan is an avid skier and hiker.

Pranesh Anubhav is a Senior Software Engineer for Amazon Personalize. He is passionate about designing machine learning systems to serve customers at scale. Outside of his work, he loves playing soccer and is an avid follower of Real Madrid.

Tianmin Liu is a senior software engineer working for Amazon personalize. He focuses on developing recommender systems at scale using various machine learning algorithms. In his spare time, he likes playing video games, watching sports, and playing the piano.

Abhishek Mangal is a software engineer working for Amazon Personalize. He works on developing recommender systems at scale using various machine learning algorithms. In his spare time, he likes to watch anime and believes One Piece is the greatest piece of storytelling in recent history.

Yifei Ma is a Senior Applied Scientist at AWS AI Labs working on recommender systems. His research interests lie in active learning, generative models, time series analysis, and online decision-making. Outside of work, he is an aviation enthusiast.

Hao Ding is a Senior Applied Scientist at AWS AI Labs and is working on advancing the recommender system for Amazon Personalize. His research interests lie in recommendation foundation models, Bayesian deep learning, large language models, and their applications in recommendation.

Rishabh Agrawal is a Senior Software Engineer working on AI services at AWS. In his spare time, he enjoys hiking, traveling and reading.

Read More

Get started with Amazon Titan Text Embeddings V2: A new state-of-the-art embeddings model on Amazon Bedrock

Get started with Amazon Titan Text Embeddings V2: A new state-of-the-art embeddings model on Amazon Bedrock

Embeddings are integral to various natural language processing (NLP) applications, and their quality is crucial for optimal performance. They are commonly used in knowledge bases to represent textual data as dense vectors, enabling efficient similarity search and retrieval. In Retrieval Augmented Generation (RAG), embeddings are used to retrieve relevant passages from a corpus to provide context for language models to generate informed, knowledge-grounded responses. Embeddings also play a key role in personalization and recommendation systems by representing user preferences, item characteristics, and historical interactions as vectors, allowing calculation of similarities for personalized recommendations based on user behavior and item embeddings. As new embedding models are released with incremental quality improvements, organizations must weigh the potential benefits against the associated costs of upgrading, considering factors like computational resources, data reprocessing, integration efforts, and projected performance gains impacting business metrics.

In September of 2023, we announced the launch of Amazon Titan Text Embeddings V1, a multilingual text embeddings model that converts text inputs like single words, phrases, or large documents into high-dimensional numerical vector representations. Since then, many of our customers have used the V1 model, which supported over 25 languages, with an input up to 8,192 tokens and outputs vector of 1,536 dimensions for high accuracy and low latency. The model was made available as a serverless offering via Amazon Bedrock, simplifying embedding generation and integration with downstream applications. We published a follow-up post on January 31, 2024, and provided code examples using AWS SDKs and LangChain, showcasing a Streamlit semantic search app.

Today, we are happy to announce Amazon Titan Text Embeddings V2, our second-generation embeddings model for Amazon Bedrock. The new model is optimized for the most common use cases we see with many of our active customers, including RAG, multi-language, and code embedding use cases. The following table summarizes the key differences compared to V1.

Feature Amazon Titan Text Embeddings V1 Amazon Titan Text Embeddings V2
Output dimension support 1536 256, 512, 1024
Language support 25+ 100+
Unit vector normalization support No Yes
Price per million tokens $0.10 $0.02 per 1 million tokens, or $0.00002 per 1,000 tokens

With these new features, we expect many more customers choosing Amazon Titan Text Embeddings V2 to build common generative artificial intelligence (AI) applications. In this post, we discuss the benefits of the V2 model, how to conduct your own evaluation of the model, and how to migrate to using the new model.

Let’s dig in!

Benefits of Amazon Titan Text Embeddings V2

Amazon Titan Text Embeddings V2 is the second-generation embedding model for Amazon Bedrock, optimized for some of the most common customer use cases we have seen with our customers. Some of the key features include:

  • Optimized for RAG solutions
  • Flexible embedding sizes
  • Improved multilingual support and code

Embeddings have become an integral part of various NLP applications, and their quality is crucial for achieving optimal performance.

The large language model (LLM) landscape is rapidly evolving, with leading providers offering increasingly powerful and versatile embedding models. Although incremental improvements in embedding quality may seem modest at the high level, the actual benefits can be significant for specific use cases. For example, in a recommendation system for a large ecommerce platform, a modest increase in recommendation accuracy could translate into significant additional revenue.

A common way to select an embedding model (or any model) is to look at public benchmarks; an accepted benchmark for measuring embedding quality is the MTEB leaderboard. The Massive Text Embedding Benchmark (MTEB) evaluates text embedding models across a wide range of tasks and datasets. MTEB encompasses 8 different embedding tasks, covering a total of 58 datasets and 112 languages. In this benchmark, 33 different text embedding models were evaluated on the MTEB tasks. A key finding from the benchmark was that no single text embedding method emerged as the clear leader across all tasks and datasets. Each model exhibited strengths and weaknesses depending on the specific embedding task and data characteristics. This highlights the need for continued research into developing more versatile and robust text embedding techniques that can perform well across diverse use cases and language domains.

Although this is a useful benchmark, we caution our enterprise customers with the following considerations:

  • Although the MTEB leaderboard is widely recognized, it provides only a partial assessment by focusing solely on accuracy metrics and overlooking crucial practical factors like inference latency and model capabilities. The leaderboard rankings combine and compare embedding models across different vector dimensions, making direct and fair model comparisons challenging.
  • Additionally, the leaders on this accuracy-centric leaderboard change frequently as new models are continually introduced, providing a shifting and incomplete perspective on practical model performance trade-offs that real-world applications must consider beyond just accuracy numbers.
  • Lastly, costs need to be weighed against the expected benefits and performance improvements in the specific use case. A small gain in accuracy may not justify the significant overhead and opportunity costs of transitioning embeddings models, especially in large-scale, business-critical applications. Enterprises should perform a rigorous cost-benefit analysis to make sure the projected performance uplift from an updated embeddings model provides sufficient return on investment (ROI) to offset the migration costs and operational disruption.

In summary, start with evaluating the benchmark scores, but don’t decide until you have done your own due diligence.

Benchmark results

The Amazon Titan Text Embeddings V2 model has the ability to output embeddings of various size. This implies that if you use a lower size, you’ll reduce your memory footprint, which will translate directly into cost savings. The default size is 1024, compared to V1, which is an 1536 output size, implying a direct cost reduction of approximately 33%, which translates into savings given the cost of a RAG solution has a major component in the form of a vector databases. In our internal testing, we found that using the 256-output token resulted in only about 3.24% accuracy loss while translating to a four times saving due to size reduction. Running our evaluation on MTEB datasets, we found Amazon Titan Text Embeddings V2 to perform competitively with scores like 57.5 on reranking tasks, for example. With the model trained on over 100 languages, it’s no surprise the model achieves scores like 55 on the MIRACL multilingual dataset and has an overall weighted average MTEB score of 60.37. Full MTEB scores are available on the MTEB leaderboard.

However, we strongly encourage you to run your own benchmarks with your own dataset to understand the operational metrics. A sample notebook showing how to run the benchmarks against the MTEB datasets is hosted here. The key steps involved are:

  1. Choose a representative set of data to embed and keywords to search.
  2. Use the Amazon Titan Text Embeddings V2 model to embed your data and keywords, adjusting the chunk size and overlap as needed.
  3. Carry out a similarity search using your preferred vector comparison method (such as Euclidean distance or cosine similarity).

Use Amazon Titan Text Embeddings V2 on Amazon Bedrock

The new Amazon Titan Text Embeddings V2 model is available through the fully managed, serverless experience on Amazon Bedrock. You can use the model through either the Amazon Bedrock REST API or the AWS SDK. The required parameters are the text that you want to generate the embeddings of and the modelID parameter, which represents the name of the Amazon Titan Text Embeddings model. Furthermore, now you can specify the output size of the vector, which is a significant feature of the V2 model.

Throughput has been a key requirement for running large ingestion workloads, and the Amazon Titan Text Embeddings model supports batching via Bedrock Batch to increase the throughput for your workloads. The following code is an example using the AWS SDK for Python (Boto3):

import boto3
import json
 
#Create the connection to Bedrock
 
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2', 
    
)

# Define prompt and model parameters
prompt_data = """Priority should be funding retirement through ROTH/IRA/401K over HSA extra.  You need to fund your HSA for reasonable and expected medical expenses. """
modelId = "amazon.titan-embed-text-v2:0"   
accept = "application/json"
contentType = "application/json"

sample_model_input={
    "inputText": prompt_data,
    "dimensions": 256,
    "normalize": True
}

body = json.dumps(sample_model_input)
# Invoke model
response = bedrock_runtime.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)

response_body = json.loads(response.get('body').read())
embedding = response_body.get("embedding")
# Print response and embedding
print(f"The embedding vector has {len(embedding)} valuesn{embedding[0:3]+['...']+embedding[-3:]}")

The full notebook is available at on the Github Repo.

With Amazon Titan Text Embeddings, you can input up to 8,192 tokens, allowing you to work with phrases or entire documents based on your use case. The model returns output vectors of a range of dimensions from 256–1024 without sacrificing accuracy, while also optimizing for cost storage and low latency. Typically, you will find larger content window models tuned for accuracy while sacrificing latency because they’re typically used in asynchronous workloads. However, with its larger content window, Amazon Titan Text Embeddings is able to achieve low latency, and with batching, it gives higher throughput for your workloads.

Run your own benchmarking

We always encourage our customers to perform their own benchmarking using their documents or the standard MTEB datasets and evaluation. For a sample of how to use the MTEB, see the GitHub repo. This notebook shows you how to load the dataset and set up evaluation for your specific use case (task) and run the benchmarking. If you run the benchmarking with your dataset, the typical steps involved are:

  1. Use the Amazon Titan Text Embeddings V2 model to embed your data and keywords, adjusting the chunk size and overlap as needed.
  2. Run similarity searches using your preferred distance metrics based on your choice of vector database.

A sample notebook showing how to use an in-memory database is available in the GitHub repo. This is a sample setup and should not be used for your production workloads where you would be connecting to robust vector database offerings like Amazon OpenSearch Serverless.

Migrate to Amazon Titan Text Embeddings V2

The cost and performance advantages provided by the V2 model are compelling reasons to consider reindexing your existing vector embeddings using V2. Let’s explore a few examples to illustrate the potential benefits, focusing solely on embedding costs.

Use case 1: High volume of searches

This first use case pertains to customers with a high volume of searches. The details are as follows:

  • Scenario:
    • 1 million documents, 100 million chunks, 1,000 average tokens per chunk
    • 100,000 searches per day, 1,000 token size for search
  • One-time cost:
    • Number of tokens: 100,000 million
    • Price per million tokens: $0.02
    • Reindexing cost: 100,000 * $0.02 = $2,000
  • Ongoing monthly savings (compared to V1):
    • Tokens embedded per month: 30 * 100,000 * 1,000 = 3,000 million
    • Savings per month (when migrating from V1 to V2): 3,000 * ($0.1 – $0.02) = $240

For this use case, the one-time reindexing cost of $2,000 will likely break even within 8–9 months through the ongoing monthly savings.

Use case 2: Ongoing indexing

This use case is for customers with ongoing indexing. The details are as follows:

  • Scenario:
    • 500,000 documents, 50 million chunks, average 1,000 tokens per chunk
    • 10,000 (2%) new documents added per month
    • 1,000 searches per day, 1,000 token size for search
  • One-time cost:
    • Number of tokens: 50,000 million
    • Price per million tokens: $0.02
    • Reindexing cost: 50,000 * $0.02 = $1,000
  • Ongoing monthly savings (compared to V1):
    • Tokens embedded per month for storage: 1,000 * 1,000 * 1,000 = 1,000 million
    • Tokens embedded per month for search: 30 * 1,000 * 1,000 = 30 million
    • Savings per month (vs. V1): 1,030 * ($0.1 – $0.02) = $82.4

For this use case, the one-time reindexing cost of $1,000 nets an estimated monthly savings of $82.4.

These calculations do not account for the additional savings due to the reduced storage size (up to four times) with V2. This could translate into further cost savings in terms of your vector database storage requirements. The extent of these savings will vary depending on your specific data storage needs.

Conclusion

In this post, we introduced the new Amazon Titan Text Embeddings V2 model, with superior performance across various use cases like retrieval, reranking, and multilingual tasks. You can potentially realize substantial cost savings and performance improvements by reindexing your vector embeddings using the V2 model. The specific benefits will vary based on factors such as the volume of data, search traffic, and storage requirements, but the examples discussed in this post illustrate the potential value proposition. Amazon Titan Text Embeddings V2 is available today in the us-east-1 and us-west-2 AWS Regions.


About the authors

Shreyas Subramanian is a Principal AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Pradeep Sridharan is a Senior Solutions Architect at AWS. He has years of experience in digital business transformation—designing and implementing solutions to drive market competitiveness and revenue growth across multiple sectors. He  specializes in AI/ML, Data Analytics and Application Modernization and Migration. Pradeep is based in Arizona (US).

Anuradha Durfee is a Senior Product Manager at AWS working on generative AI. She has spent the last five years working on natural language understanding and is motivated by enabling life-like conversations between humans and technology. Anuradha is based in Boston, MA.

Read More

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

Large language models (LLMs) are making a significant impact in the realm of artificial intelligence (AI). Their impressive generative abilities have led to widespread adoption across various sectors and use cases, including content generation, sentiment analysis, chatbot development, and virtual assistant technology. Llama2 by Meta is an example of an LLM offered by AWS. Llama 2 is an auto-regressive language model that uses an optimized transformer architecture and is intended for commercial and research use in English. It comes in a range of parameter sizes—7 billion, 13 billion, and 70 billion—as well as pre-trained and fine-tuned variations. To learn more about Llama 2 on AWS, refer to Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart.

Many practitioners fine-tune or pre-train these Llama 2 models with their own text data to improve accuracy for their specific use case. However, in some cases, a challenge arises for practitioners: the high cost of fine-tuning and training. As organizations strive to push the boundaries of what LLMs can achieve, the demand for cost-effective training solutions has never been more pressing. In this post, we explore how you can use the Neuron distributed training library to fine-tune, continuously pre-train, and reduce the cost of training LLMs such as Llama 2 with AWS Trainium instances on Amazon SageMaker.

AWS Trainium instances for training workloads

SageMaker ml.trn1 and ml.trn1n instances, powered by Trainium accelerators, are purpose-built for high-performance deep learning training and offer up to 50% cost-to-train savings over comparable training optimized Amazon Elastic Compute Cloud (Amazon EC2) instances. This post implements a solution with the ml.trn1.32xlarge Trainium instance type, typically used for training large-scale models. However, there are also comparable ml.trn1n instances that offer twice as much networking throughput (1,600 Gbps) via Amazon Elastic Fabric Adapter (EFAv2). SageMaker Training supports the availability of ml.trn1 and ml.trn1n instances in the US East (N. Virginia) and US West (Oregon) AWS Regions, and most recently announced general availability in the US East (Ohio) Region. These instances are available in the listed Regions with On-Demand, Reserved, and Spot Instances, or additionally as part of a Savings Plan.

For more information on Trainium Accelerator chips, refer to Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker. Additionally, check out AWS Trainium Customers to learn more about customer testimonials, or see Amazon EC2 Trn1 Instances for High-Performance Model Training are Now Available to dive into the accelerator highlights and specifications.

Using the Neuron Distributed library with SageMaker

SageMaker is a fully managed service that provides developers, data scientists, and practitioners the ability to build, train, and deploy machine learning (ML) models at scale. SageMaker Training includes features that improve and simplify the ML training experience, including managed infrastructure and images for deep learning, automatic model tuning with hyperparameter optimization, and a pay-for-what-you-use billing structure. This section highlights the advantages of using SageMaker for distributed training with the Neuron Distributed library—specifically, the managed infrastructure, time-to-train, and cost-to-train benefits of its associated resiliency and recovery features, and is part of the AWS Neuron SDK used to run deep learning workloads on AWS Inferentia and AWS Trainum based instances.

In high performance computing (HPC) clusters, such as those used for deep learning model training, hardware resiliency issues can be a potential obstacle. Although hardware failures while training on a single instance may be rare, issues resulting in stalled training become more prevalent as a cluster grows to tens or hundreds of instances. Regular checkpointing helps mitigate wasted compute time, but engineering teams managing their own infrastructure must still closely monitor their workloads and be prepared to remediate a failure at all hours to minimize training downtime. The managed infrastructure of SageMaker Training includes several resiliency features that make this monitoring and recovery process streamlined:

  • Cluster health checks – Before a training job starts, SageMaker runs health checks and verifies communication on the provisioned instances. It then replaces any faulty instances, if necessary, to make sure the training script starts running on a healthy cluster of instances. Health checks are currently enabled for the TRN1 instance family as well as P* and G* GPU-based instance types.
  • Automatic checkpointing – Checkpoints from a local path (/opt/ml/checkpoints by default) are automatically copied to an Amazon Simple Storage Service (Amazon S3) location specified by the user. When training is restarted, SageMaker automatically copies the previously saved checkpoints from the S3 location back to the local checkpoint directory to make sure the training script can load and resume the last saved checkpoint.
  • Monitoring and tracking training – In the case of a node failure, it’s important to have the visibility of where the failure occurs. Using PyTorch Neuron gives data scientists the ability to track training progress in a TensorBoard. This allows you to capture the loss of the training job to determine when the training job should be stopped to identify the convergence of the model for optimal training.
  • Built-in retries and cluster repair – You can configure SageMaker to automatically retry training jobs that fail with a SageMaker internal server error (ISE). As part of retrying a job, SageMaker replaces any instances that encountered unrecoverable errors with fresh instances, reboots all healthy instances, and starts the job again. This results in faster restarts and workload completion. Cluster update is currently enabled for the TRN1 instance family as well as P and G GPU-based instance types. Practitioners can add in their own applicative retry mechanism around the client code that submits the job, to handle other types of launch errors, such as like exceeding your account quota.

For customers working with large clusters of hundreds of instances for a training job, the resiliency and recovery features of SageMaker Training can reduce total time for a model to converge by up to 20% via fewer failures and faster recovery. This also enables engineering teams to monitor and react to failures at all hours. Although SageMaker training jobs are suitable for general-purpose training use cases with customizable configurations and integration with the broader AWS ecosystem, Amazon SageMaker HyperPod is specifically optimized for efficient and resilient training of foundation models at scale. For more information on SageMaker HyperPod use cases, refer to the SageMaker HyperPod developer guide.

In this post, we use the Neuron Distributed library to continuously pre-train a Llama 2 model using tensor and pipeline parallelism using SageMaker training jobs. To learn more about the resiliency and recovery features of SageMaker Training, refer to Training large language models on Amazon SageMaker: Best practices.

Solution overview

In this solution, we use an ml.t3.medium instance type on a SageMaker Jupyter notebook to process the provided cells. We will be continuously pre-training our llama2-70b model using the trn1.32xlarge Trainium instance. First, let’s familiarize ourselves with the techniques we use to handle the distribution of the training job created in our solution to contiuously pre-train our llama2-70b model using the Neuron distributed training library.

The techniques used to convert the pre-trained weights in the convert_pretrained_weights.ipynb notebook into a .pt (PyTorch) weights file are called pipeline parallelism and tensor parallelism:

  • Pipeline parallelism involves a training strategy that combines elements of pipeline parallelism to optimize the training process by splitting a batch or deep neural network into multiple microbatches or layers, allowing each stage worker to process one microbatch.
  • Tensor parallelism splits tensors of a neural network into multiple devices. This technique allows models with large tensors that can’t fit into the memory of a single device.

After we convert our pre-trained weights with the preceding techniques in our first notebook, we follow two separate notebooks in the same sagemaker-trainium-examples folder. The second notebook is Training_llama2_70b.ipynb, which walks through the continuous pre-training process by saving our checkpoint of converted model weights in the first notebook and prepping it for inference. When this step is complete, we can run the Convert_Nxd_to_hf.ipynb notebook, which takes our pre-trained weights using the NeuronX library and converts it into a readable format in Hugging Face to serve inference.

Prerequisites

You need to complete some prerequisites before you can run the first notebook.

First, make sure you have created a Hugging Face access token so you can download the Hugging Face tokenizer to be used later. After you have the access token, you need to make a few quota increase requests for SageMaker. You need to request a minimum of 8 Trn1 instances ranging to a maximum of 32 Trn1 instances (depending on time-to-train and cost-to-train trade-offs for your use case).

On the Service Quotas console, request the following SageMaker quotas:

  • Trainium instances (ml.trn1.32xlarge) for training job usage: 8–32
  • ml.trn1.32xlarge for training warm pool usage: 8–32
  • Maximum number of instances per training job: 8–32

It may take up to 24 hours for the quota increase to get approved. However, after submitting the quota increase, you can go to the sagemaker-trainium-examples GitHub repo and locate the convert_pretrained_weights.ipynb file. This is the file that you use to begin the continual pre-training process.

Now that you’re ready to begin the process to continuously pre-train the llama2-70b model, you can convert the pre-trained weights in the next section to prep the model and create the checkpoint.

Getting started

Complete the following steps:

  1. Install all the required packages and libraries: SageMaker, Boto3, transformers, and datasets.

These packages make sure that you can set up your environment to access your pre-trained Llama 2 model, download your tokenizer, and get your pre-training dataset.

!pip install -U sagemaker boto3 --quiet
!pip install transformers datasets[s3] --quiet
  1. After the packages are installed, retrieve your Hugging Face access token, and download and define your tokenizer.

The tokenizer meta-llama/Llama-2-70b-hf is a specialized tokenizer that breaks down text into smaller units for natural language processing. This tokenized data will later be uploaded into Amazon S3 to allow for running your training job.

from huggingface_hub.hf_api
import HfFolder
# Update the access token to download the tokenizer
access_token = "hf_insert-key-here"
HfFolder.save_token(access_token)

from transformers import AutoTokenizer
tokenizer_name = "meta-llama/Llama-2-70b-hf"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
block_size = 4096
  1. After following the above cells, you will now download the wikicorpus dataset from the Hugging Face dataset.
  2. Tokenize the dataset with the llama-2 tokenizer that you just initialized.

By tokenizing the data, you’re preparing to pre-train your Llama 2 model to enhance the model’s performance to expose it to the trilingual (Catalan, English, Spanish) text data in the wikicorpus dataset to learn intricate patterns and relationships in the dataset.

After the data is tokenized, run the following cell to store the training dataset to s3:

# save training dataset to s3
training_input_path = f's3://{sess.default_bucket()}/neuronx_distributed/data'
print(f"uploading training dataset to: {training_input_path}")
train_dataset.save_to_disk(training_input_path)

print(f"uploaded data to: {training_input_path}")

The cell above makes sure that you define the training_input_path and have uploaded the data to your S3 bucket. You’re now ready to begin the training job process.

Run the training job

For the training job, we use the trn1.32xlarge instances with each of the instances having 32 neuron cores. We use tensor parallelism and pipeline parallelism, which allows you to shard the model across Neuron cores for training.

The following code is the configuration for pretraining llama2-70b with trn1:

#Number of processes per node
PROCESSES_PER_NODE = 32
# Number of instances within the cluster, change this if you want to tweak the instance_count parameter
WORLD_SIZE = 32
# Global batch size
GBS = 512
# Input sequence length
SEQ_LEN = 4096
# Pipeline parallel degree
PP_DEGREE = 8<br /># Tensor parallel degree
TP_DEGREE = 8
# Data paralell size
DP = ((PROCESSES_PER_NODE * WORLD_SIZE / TP_DEGREE / PP_DEGREE))
# Batch size per model replica
BS = ((GBS / DP))
# Number microbatches for pipeline execution. Setting same as BS so each microbatch contains a single datasample
NUM_MICROBATCHES = BS
# Number of total steps for which to train model. This number should be adjusted to the step number when the loss function is approaching convergence.
MAX_STEPS = 1500
# Timeout in seconds for training. After this amount of time Amazon SageMaker terminates the job regardless of its current status.
MAX_RUN = 2 * (24 * 60 * 60)

Now you can define the hyperparameters for training. Note that adjusting these parameters based on hardware capabilities, dataset characteristics, and convergence requirements can significantly impact training performance and efficiency.

The following is the code for the hyperparameters:

hyperparameters = {}
hyperparameters["train_batch_size"] = int(BS)
hyperparameters["use_meta_device_init"] = 1
hyperparameters["training_dir"] = "/opt/ml/input/data/train" # path where sagemaker uploads the training data
hyperparameters["training_config"] = "config.json" # config file containing llama 70b configuration , change this for tweaking the number of parameters.

hyperparameters["max_steps"] = MAX_STEPS
hyperparameters["seq_len"] = SEQ_LEN
hyperparameters["pipeline_parallel_size"] = PP_DEGREE
hyperparameters["tensor_parallel_size"] = TP_DEGREE
hyperparameters["num_microbatches"] = int(NUM_MICROBATCHES)
hyperparameters["lr"] = 0.00015
hyperparameters["min_lr"] = 1e-05
hyperparameters["beta1"] = 0.9
hyperparameters["beta2"] = 0.95
hyperparameters["weight_decay"] = 0.1
hyperparameters["warmup_steps"] = 2000
hyperparameters["constant_steps"] = 0
hyperparameters["use_zero1_optimizer"] = 1
hyperparameters["tb_dir"] = "/opt/ml/checkpoints/tensorboard" # The tensorboard logs will be stored here and eventually pushed to S3.

Now you specify the Docker image that will be used to train the model on Trainium:

docker_image = f"763104351884.dkr.ecr.{region_name}.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.18.0-ubuntu20.04"

The image we defined is designed for PyTorch training with Neuron optimizations. This image is configured to work with PyTorch, using Neuron SDK version 2.18.0 for enhanced performance and efficiency on Trn1 instances equipped with AWS Trainium chips. This image is also compatible with Python 3.10, indicated by the py310, and is based on Ubuntu 20.04.

Prior to starting your training job, you need to configure it by defining all necessary variables. You do so by defining the training job name, checkpoint directory, and cache directory:

import time
# Define Training Job Name
job_name = f'llama-neuron-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
# Define checkpoint directory that contains the weights and other relevant data for the trained model
checkpoint_s3_uri = "s3://" + sagemaker_session_bucket + "/neuron_llama_experiment"
checkpoint_dir = '/opt/ml/checkpoints'</p><p>
In [ ]:
# Define neuron chache directory
cache_dir = "/opt/ml/checkpoints/neuron_cache"

The parameters enable you to do the following:

  • The training job allows you to identify and track individual training jobs based on timestamps
  • The checkpoint directory specifies the S3 URI where the checkpoint data, weights, and other information are stored for the trained model
  • The cache directory helps optimize the training process by storing and reusing previously calculated values, from the checkpoint directory, reducing redundancy and improving efficiency
  • The environment variables make sure that the training job is optimally configured and settings are tailored to enable efficient and effective training using features like RDMA, optimized memory allocation, fused operations, and Neuron-specific device optimizations

After you have defined your training job and configured all directories and environment variables for an optimal training pipeline, you now set up your PyTorch estimator to begin the training job on SageMaker. A SageMaker estimator is a high-level interface that handles the end-to-end SageMaker training and deployment tasks.

The entry_point is specified as the Python script run_llama_nxd.py. We use the instance_type ml.trn1.32xlarge, the instance count is 32 (which was previously defined as a global variable in the configuration code), and input_mode is set to FastFile. Fast File mode in SageMaker streams data from Amazon S3 on demand, which optimizes data loading performance by fetching data as needed, reducing overall resource consumption. For more information on input, refer to Access Training Data.

from sagemaker.pytorch import PyTorch

# Handle end-to-end Amazon SageMaker training and deployment tasks.
pt_estimator = PyTorch(<br />entry_point='run_llama_nxd.py',
source_dir='./scripts',<br />instance_type="ml.trn1.32xlarge",
image_uri=docker_image,<br />instance_count=WORLD_SIZE,
max_run=MAX_RUN,
hyperparameters=hyperparameters,
role=role,
base_job_name=job_name,
environment=env,
input_mode="FastFile",
disable_output_compression=True,
keep_alive_period_in_seconds=600, # this is added to enable warm pool capability
checkpoint_s3_uri=checkpoint_s3_uri,
checkpoint_local_path=checkpoint_dir,
distribution={"torch_distributed": {"enabled": True}} # enable torchrun
)

Finally, you can start the training job with the SageMaker fit() method, which trains the model based on the defined hyperparameters:

# Start training job
pt_estimator.fit({"train": training_input_path})

You have successfully started the process to continuously pre-train a llama2-70b model by converting pre-trained weights with tokenized data using SageMaker training on Trainium instances.

Continuous pre-training

After following the prerequisites, completing the provided notebook, and converting the pre-trained weights as a checkpoint, you can now begin the continual pre-training process, using the checkpoint as a point of reference to pre-train the llama2-70b model. The techniques used to convert the pre-trained weights in the convert_pretrained_weights.ipynb notebook into a .pt (PyTorch) weights file are called pipeline parallelism and tensor parallelism.

To begin the continuous pre-training process, follow the Training_llama2_70b.ipynb file in the sagemaker-trainium-examples repo.

Given the large size of the llama2-70b model, you need to convert the pre-trained weights into a more efficient and useable format (.pt). You can do so by defining the hyperparameters in your configuration to store converted weights and checkpoints. The following are the hyperparameters:

# Use the sagemaker s3 checkpoints mechanism since we need read/write access to the paths.
hyperparameters["output_dir"] = "/opt/ml/checkpoints/llama70b_weights"
hyperparameters["checkpoint-dir"] = '/opt/ml/checkpoints'<br />hyperparameters["n_layers"] = 80
hyperparameters["convert_from_full_model"] = ""

If you look at the hyperparameters, the output_dir is used as a reference for pre-training. If you are at this cell, you should have already followed the Training_llama2_70b.ipynb notebook and gone through the process of setting up your SageMaker client and Docker image, and preparing the pre-trained weights for pre-training. You’re now ready to perform the continuous pre-training process on the llama2-70b model.

We use the following parameters to take the pre-trained weights stored in output_dir in the convert_pretrained_weights.ipynb file to be reused continuously for pre-training:

hyperparameters["checkpoint_dir"] = "/opt/ml/checkpoints/checkpts"
hyperparameters["checkpoint_freq"] = 10
hyperparameters["num_kept_checkpoint"] = 1
hyperparameters["use_zero1_optimizer"] = 1
hyperparameters["save_load_xser"] = 0
hyperparameters["pretrained_weight_dir"] = "/opt/ml/checkpoints/llama70b_weights"

After these hyperparameters are implemented, you can run the rest of the notebook cells to complete the continuous pre-training process. After the SageMaker estimator has completed the training job, you can locate the new checkpoint in the S3 checkpoint directory containing the weights. You can now locate the convert_Nxd_to_hf.ipynb file to get the checkpoint ready for inferencing.

Convert the Neuron Distributed checkpoint for inferencing

Checkpoints play a vital role in the context of distributed training with the NeuronX library because it has checkpoint compatibility with Hugging Face Transformers. You can get the training job output ready for inferencing by taking the training job that is saved as a NeuronX distributed checkpoint and converting the weights into .pt weights files.

To convert the checkpoints to Hugging Face format using NeuronX, you first need to save the S3 nxd_checkpoint_path directory:

# S3 checkpoint directory that contains the weights and other relevant data from the continuous pre-trained model
checkpoint_s3_uri = "&lt;pre-training-checkpoint-s3-uri&gt;"
nxd_checkpoint_path = f"s3://{checkpoint_s3_uri}/neuronx_llama_experiment/checkpts/step10/model/"
# Checkpoint is saved as part of Notebook 2

After you save the checkpoint in the nxd_checkpoint_path directory, you can save your hyperparameters and configure your SageMaker estimator, which makes sure the pre-training process can begin. You can now run the fit() function within the estimator to convert the pre-trained weights into a checkpoint for inferencing with the following cell:

# Start SageMaker job
estimator.fit({"checkpoint": nxd_checkpoint_path})

Summary

You have successfully performed continuous pre-training on a llama2-70b model by converting your pre-trained weights and checkpoint to be used to serve inference using the Neuron SDK and Trainium instances. By following the solution in this post, you should now know how to configure a pipeline for continuous pre-training of an LLM using SageMaker and Trainium accelerator chips.

For more information on how to use Trainium for your workloads, refer to the Neuron SDK documentation or reach out directly to the team. We value customer feedback and are always looking to engage with ML practitioners and builders. Feel free to leave comments or questions in the comments section.


About the authors

Marco Punio is a Solutions Architect focused on generative AI strategy, applied AI solutions and conducting research to help customers hyperscale on AWS. He is a qualified technologist with a passion for machine learning, artificial intelligence, and mergers & acquisitions. Marco is based in Seattle, WA and enjoys writing, reading, exercising, and building applications in his free time.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and Data Analytics. At AWS, Armando helps customers integrating cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, train, and migrate ML production workloads to SageMaker at scale. He specializes in deep learning, especially in the area of NLP and CV. Outside of work, he enjoys running and hiking.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads frameworks, compilers, and optimization techniques for deep learning training.

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Rohit Talluri is a Generative AI GTM Specialist (Tech BD) at Amazon Web Services (AWS). He is partnering with top generative AI model builders, strategic customers, key AI/ML partners, and AWS Service Teams to enable the next generation of artificial intelligence, machine learning, and accelerated computing on AWS. He was previously an Enterprise Solutions Architect, and the Global Solutions Lead for AWS Mergers & Acquisitions Advisory.

Sebastian Bustillo is a Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the world with his wife.

Read More