How Getir reduced model training durations by 90% with Amazon SageMaker and AWS Batch

How Getir reduced model training durations by 90% with Amazon SageMaker and AWS Batch

This is a guest post co-authored by Nafi Ahmet Turgut, Hasan Burak Yel, and Damla Şentürk from Getir.

Established in 2015, Getir has positioned itself as the trailblazer in the sphere of ultrafast grocery delivery. This innovative tech company has revolutionized the last-mile delivery segment with its compelling offering of “groceries in minutes.” With a presence across Turkey, the UK, the Netherlands, Germany, and the United States, Getir has become a multinational force to be reckoned with. Today, the Getir brand represents a diversified conglomerate encompassing nine different verticals, all working synergistically under a singular umbrella.

In this post, we explain how we built an end-to-end product category prediction pipeline to help commercial teams by using Amazon SageMaker and AWS Batch, reducing model training duration by 90%.

Understanding our existing product assortment in a detailed manner is a crucial challenge that we, along with many businesses, face in today’s fast-paced and competitive market. An effective solution to this problem is the prediction of product categories. A model that generates a comprehensive category tree allows our commercial teams to benchmark our existing product portfolio against that of our competitors, offering a strategic advantage. Therefore, our central challenge is the creation and implementation of an accurate product category prediction model.

We capitalized on the powerful tools provided by AWS to tackle this challenge and effectively navigate the complex field of machine learning (ML) and predictive analytics. Our efforts led to the successful creation of an end-to-end product category prediction pipeline, which combines the strengths of SageMaker and AWS Batch.

This capability of predictive analytics, particularly the accurate forecast of product categories, has proven invaluable. It provided our teams with critical data-driven insights that optimized inventory management, enhanced customer interactions, and strengthened our market presence.

The methodology we explain in this post ranges from the initial phase of feature set gathering to the final implementation of the prediction pipeline. An important aspect of our strategy has been the use of SageMaker and AWS Batch to refine pre-trained BERT models for seven different languages. Additionally, our seamless integration with AWS’s object storage service Amazon Simple Storage Service (Amazon S3) has been key to efficiently storing and accessing these refined models.

SageMaker is a fully managed ML service. With SageMaker, data scientists and developers can quickly and effortlessly build and train ML models, and then directly deploy them into a production-ready hosted environment.

As a fully managed service, AWS Batch helps you run batch computing workloads of any scale. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. With AWS Batch, there’s no need to install or manage batch computing software, so you can focus your time on analyzing results and solving problems. We used GPU jobs that help us run jobs that use an instance’s GPUs.

Overview of solution

Five people from Getir’s data science team and infrastructure team worked together on this project. The project was completed in a month and deployed to production after a week of testing.

The following diagram shows the solution’s architecture.

The model pipeline is run separately for each country. The architecture includes two AWS Batch GPU cron jobs for each country, running on defined schedules.

We overcame some challenges by strategically deploying SageMaker and AWS Batch GPU resources. The process used to address each difficulty is detailed in the following sections.

Fine-tuning multilingual BERT models with AWS Batch GPU jobs

We sought a solution to support multiple languages for our diverse user base. BERT models were an obvious choice due to their established ability to handle complex natural language tasks effectively. In order to tailor these models to our needs, we harnessed the power of AWS by using single-node GPU instance jobs. This allowed us to fine-tune pre-trained BERT models for each of the seven languages we required support for. Through this method, we ensured high precision in predicting product categories, overcoming any potential language barriers.

Efficient model storage using Amazon S3

Our next step was to address model storage and management. For this, we selected Amazon S3, known for its scalability and security. Storing our fine-tuned BERT models on Amazon S3 enabled us to provide easy access to different teams within our organization, thereby significantly streamlining our deployment process. This was a crucial aspect in achieving agility in our operations and a seamless integration of our ML efforts.

Creating an end-to-end prediction pipeline

An efficient pipeline was required to make the best use of our pre-trained models. We first deployed these models on SageMaker, an action that allowed for real-time predictions with low latency, thereby enhancing our user experience. For larger-scale batch predictions, which were equally vital to our operations, we utilized AWS Batch GPU jobs. This ensured the optimal use of our resources, providing us with a perfect balance of performance and efficiency.

Exploring future possibilities with SageMaker MMEs

As we continue to evolve and seek efficiencies in our ML pipeline, one avenue we are keen to explore is using SageMaker multi-model endpoints (MMEs) for deploying our fine-tuned models. With MMEs, we can potentially streamline the deployment of various fine-tuned models, ensuring efficient model management while also benefiting from the native capabilities of SageMaker like shadow variants, auto scaling, and Amazon CloudWatch integration. This exploration aligns with our continuous pursuit of enhancing our predictive analytics capabilities and providing superior experiences to our customers.

Conclusion

Our successful integration of SageMaker and AWS Batch has not only addressed our specific challenges but also significantly boosted our operational efficiency. Through the implementation of a sophisticated product category prediction pipeline, we are able to empower our commercial teams with data-driven insights, thereby facilitating more effective decision-making.

Our results speak volumes about our approach’s effectiveness. We have achieved an 80% prediction accuracy across all four levels of category granularity, which plays an important role in shaping the product assortments for each country we serve. This level of precision extends our reach beyond language barriers and ensures we cater to our diverse user base with the utmost accuracy.

Moreover, by strategically using scheduled AWS Batch GPU jobs, we’ve been able to reduce our model training durations by 90%. This efficiency has further streamlined our processes and bolstered our operational agility. Efficient model storage using Amazon S3 has played a critical role in this achievement, balancing both real-time and batch predictions.

For more information about how to get started building your own ML pipelines with SageMaker, see Amazon SageMaker resources. AWS Batch is an excellent option if you are looking for a low-cost, scalable solution for running batch jobs with low operational overhead. To get started, see Getting Started with AWS Batch.


About the Authors

Nafi Ahmet Turgut finished his master’s degree in Electrical & Electronics Engineering and worked as a graduate research scientist. His focus was building machine learning algorithms to simulate nervous network anomalies. He joined Getir in 2019 and currently works as a Senior Data Science & Analytics Manager. His team is responsible for designing, implementing, and maintaining end-to-end machine learning algorithms and data-driven solutions for Getir.

Hasan Burak Yel received his bachelor’s degree in Electrical & Electronics Engineering at Boğaziçi University. He worked at Turkcell, mainly focused on time series forecasting, data visualization, and network automation. He joined Getir in 2021 and currently works as a Data Science & Analytics Manager with the responsibility of Search, Recommendation, and Growth domains.

Damla Şentürk received her bachelor’s degree of Computer Engineering at Galatasaray University. She continues her master’s degree of Computer Engineering in Boğaziçi University. She joined Getir in 2022, and has been working as a Data Scientist. She has worked on commercial, supply chain, and discovery-related projects.

Esra Kayabalı is a Senior Solutions Architect at AWS, specialized in the analytics domain, including data warehousing, data lakes, big data analytics, batch and real-time data streaming, and data integration. She has 12 years of software development and architecture experience. She is passionate about learning and teaching cloud technologies.

Read More

Boosting developer productivity: How Deloitte uses Amazon SageMaker Canvas for no-code/low-code machine learning

Boosting developer productivity: How Deloitte uses Amazon SageMaker Canvas for no-code/low-code machine learning

The ability to quickly build and deploy machine learning (ML) models is becoming increasingly important in today’s data-driven world. However, building ML models requires significant time, effort, and specialized expertise. From data collection and cleaning to feature engineering, model building, tuning, and deployment, ML projects often take months for developers to complete. And experienced data scientists can be hard to come by.

This is where the AWS suite of low-code and no-code ML services becomes an essential tool. With just a few clicks using Amazon SageMaker Canvas, you can take advantage of the power of ML without needing to write any code.

As a strategic systems integrator with deep ML experience, Deloitte utilizes the no-code and low-code ML tools from AWS to efficiently build and deploy ML models for Deloitte’s clients and for internal assets. These tools allow Deloitte to develop ML solutions without needing to hand-code models and pipelines. This can help speed up project delivery timelines and enable Deloitte to take on more client work.

The following are some specific reasons why Deloitte uses these tools:

  • Accessibility for non-programmers – No-code tools open up ML model building to non-programmers. Team members with just domain expertise and very little coding skills can develop ML models.
  • Rapid adoption of new technology – Availability and constant improvement on ready-to-use models and AutoML helps ensure that users are constantly using leading-class technology.
  • Cost-effective development – No-code tools help reduce the cost and time required for ML model development, making it more accessible to clients, which can help them achieve a higher return on investment.

Additionally, these tools provide a comprehensive solution for faster workflows, enabling the following:

  • Faster data preparation – SageMaker Canvas has over 300 built-in transformations and the ability to use natural language that can accelerate data preparation and making data ready for model building.
  • Faster model building – SageMaker Canvas offers ready-to-use models or Amazon AutoML technology that enables you to build custom models on enterprise data with just a few clicks. This helps speed up the process compared to coding models from the ground up.
  • Easier deployment – SageMaker Canvas offers the ability to deploy production-ready models to an Amazon Sagmaker endpoint in a few clicks while also registering it in Amazon SageMaker Model Registry.

Vishveshwara Vasa, Cloud CTO for Deloitte, says:

“Through AWS’s no-code ML services such as SageMaker Canvas and SageMaker Data Wrangler, we at Deloitte Consulting have unlocked new efficiencies, enhancing the speed of development and deployment productivity by 30–40% across our client-facing and internal projects.”

In this post, we demonstrate the power of building an end-to-end ML model with no code using SageMaker Canvas by showing you how to build a classification model for predicting if a customer will default on a loan. By predicting loan defaults more accurately, the model can help a financial services company manage risk, price loans appropriately, improve operations, provide additional services, and gain a competitive advantage. We demonstrate how SageMaker Canvas can help you rapidly go from raw data to a deployed binary classification model for loan default prediction.

SageMaker Canvas offers comprehensive data preparation capabilities powered by Amazon SageMaker Data Wrangler in the SageMaker Canvas workspace. This enables you to go through all the phases of a standard ML workflow, from data preparation to model building and deployment, on a single platform.

Data preparation is typically the most time-intensive phase of the ML workflow. To reduce time spent on data preparation, SageMaker Canvas allows you to prepare your data using over 300 built-in transformations. Alternatively, you can write natural language prompts, such as “drop the rows for column c that are outliers,” and be presented with the code snippet necessary for this data preparation step. You can then add this to your data preparation workflow in a few clicks. We show you how to use that in this post as well.

Solution overview

The following diagram describes the architecture for a loan default classification model using SageMaker low-code and no-code tools.

Starting with a dataset that has details about loan default data in Amazon Simple Storage Service (Amazon S3), we use SageMaker Canvas to gain insights about the data. We then perform feature engineering to apply transformations such as encoding categorical features, dropping features that are not needed, and more. Next, we store the cleansed data back in Amazon S3. We use the cleaned dataset to create a classification model for predicting loan defaults. Then we have a production-ready model for inference.

Prerequisites

Make sure that the following prerequisites are complete and that you have enabled the Canvas Ready-to-use models option when setting up the SageMaker domain. If you have already set up your domain, edit your domain settings and go to Canvas settings to enable the Enable Canvas Ready-to-use models option. Additionally, set up and create the SageMaker Canvas application, then request and enable Anthropic Claude model access on Amazon Bedrock.

Dataset

We use a public dataset from kaggle that contains information about financial loans. Each row in the dataset represents a single loan, and the columns provide details about each transaction. Download this dataset and store this in an S3 bucket of your choice. The following table lists the fields in the dataset.

Column Name Data Type Description
Person_age Integer Age of the person who took a loan
Person_income Integer Income of the borrower
Person_home_ownership String Home ownership status (own or rent)
Person_emp_length Decimal Number of years they are employed
Loan_intent String Reason for loan (personal, medical, educational, and so on)
Loan_grade String Loan grade (A–E)
Loan_int_rate Decimal Interest rate
Loan_amnt Integer Total amount of the loan
Loan_status Integer Target (whether they defaulted or not)
Loan_percent_income Decimal Loan amount compared to the percentage of the income
Cb_person_default_on_file Integer Previous defaults (if any)
Cb_person_credit_history_length String Length of their credit history

Simplify data preparation with SageMaker Canvas

Data preparation can take up to 80% of the effort in ML projects. Proper data preparation leads to better model performance and more accurate predictions. SageMaker Canvas allows interactive data exploration, transformation, and preparation without writing any SQL or Python code.

Complete the following steps to prepare your data:

  1. On the SageMaker Canvas console, choose Data preparation in the navigation pane.
  2. On the Create menu, choose Document.
  3. For Dataset name, enter a name for your dataset.
  4. Choose Create.
  5. Choose Amazon S3 as the data source and connect it to the dataset.
  6. After the dataset is loaded, create a data flow using that dataset.
  7. Switch to the analyses tab and create a Data Quality and Insights Report.

This is a recommended step to analyze the quality of the input dataset. The output of this report produces instant ML-powered insights such as data skew, duplicates in the data, missing values, and much more. The following screenshot shows a sample of the generated report for the loan dataset.

By generating these insights on your behalf, SageMaker Canvas provides you with a set of issues in the data that need remediation in the data preperation phase. To pick the top two issues identified by SageMaker Canvas, you need to encode the categorical features and remove the duplicate rows so your model quality is high. You can do both of these and more in a visual workflow with SageMaker Canvas.

  1. First, one-hot encode the loan_intent, loan_grade, and person_home_ownership
  2. You can drop the cb_person_cred_history_length column because that column has the least predicting power, as shown in the Data Quality and Insights Report.

    SageMaker Canvas recently added a Chat with data option. This feature uses the power of foundation models to interpret natural language queries and generate Python-based code to apply feature engineering transformations. This feature is powered by Amazon Bedrock, and can be configured to run entirely in a your VPC so that data never leaves the your environment.
  3. To use this feature to remove duplicate rows, choose the plus sign next to the Drop column transform, then choose Chat with data.
  4. Enter your query in natural language (for example, “Remove duplicate rows from the dataset”).
  5. Review the generated transformation and choose Add to steps to add the transformation to the flow.
  6. Finally, export the output of these transformations to Amazon S3 or optionally Amazon SageMaker Feature Store to use these features across multiple projects.

You can also add another step to create an Amazon S3 destination for the dataset to scale the workflow for a large dataset. The following diagram shows the SageMaker Canvas data flow after adding visual transformations.

You have completed the entire data processing and feature engineering step using visual workflows in SageMaker Canvas. This helps reduce the time a data engineer spends on cleaning and making the data ready for model development from weeks to days. The next step is to build the ML model.

Build a model with SageMaker Canvas

Amazon SageMaker Canvas provides a no-code end-to-end workflow for building, analyzing, testing, and deploying this binary classification model. Complete the following steps:

  1. Create a dataset in SageMaker Canvas.
  2. Specify either the S3 location that was used to export the data or the S3 location that’s on the destination of the SageMaker Canvas job.

    Now you’re ready to build the model.
  3. Choose Models in the navigation pane and choose New model.
  4. Name the model and select Predictive analysis as the model type.
  5. Choose the dataset created in the previous step.

    The next step is configuring the model type.
  6. Choose the target column and the model type will be automatically set as 2 category prediction.
  7. Choose your build type, Standard build or Quick build.

    SageMaker Canvas displays the expected build time as soon as you start building the model. Standard build usually takes between 2–4 hours; you can use the Quick build option for smaller datasets, which only takes 2–15 minutes. For this particular dataset, it should take around 45 minutes to complete the model build. SageMaker Canvas keeps you informed of the progress of the build process.
  8. After the model is built, you can look at the model performance.

    SageMaker Canvas provides various metrics like accuracy, precision, and F1 score depending on the type of the model. The following screenshot shows the accuracy and a few other advanced metrics for this binary classification model.
  9. The next step is to make test predictions.
    SageMaker Canvas allows you to make batch predictions on multiple inputs or a single prediction to quickly verify the model quality. The following screenshot shows a sample inference.
  10. The last step is to deploy the trained model.
    SageMaker Canvas deploys the model on SageMaker endpoints, and now you have a production model ready for inference. The following screenshot shows the deployed endpoint.

After the model is deployed, you can call it through the AWS SDK or AWS Command Line Interface (AWS CLI) or make API calls to any application of your choice to confidently predict the risk of a potential borrower. For more information about testing your model, refer to Invoke real-time endpoints.

Clean up

To avoid incurring additional charges, log out of SageMaker Canvas or delete the SageMaker domain that was created. Additionally, delete the SageMaker model endpoint and delete the dataset that was uploaded to Amazon S3.

Conclusion

No-code ML accelerates development, simplifies deployment, doesn’t require programming skills, increases standardization, and reduces cost. These benefits made no-code ML attractive to Deloitte to improve its ML service offerings, and they have shortened their ML model build timelines by 30–40%.

Deloitte is a strategic global systems integrator with over 17,000 certified AWS practitioners across the globe. It continues to raise the bar through participation in the AWS Competency Program with 25 competencies, including Machine Learning. Connect with Deloitte to start using AWS no-code and low-code solutions to your enterprise.


About the authors

Chida Sadayappan leads Deloitte’s Cloud AI/Machine Learning practice. He brings strong thought leadership experience to engagements and thrives in supporting executive stakeholders achieve performance improvement and modernization goals across industries using AI/ML. Chida is a serial tech entrepreneur and an avid community builder in the startup and developer ecosystems.

Kuldeep Singh, a Principal Global AI/ML leader at AWS with over 20 years in tech, skillfully combines his sales and entrepreneurship expertise with a deep understanding of AI, ML, and cybersecurity. He excels in forging strategic global partnerships, driving transformative solutions and strategies across various industries with a focus on generative AI and GSIs.

Kasi Muthu is a senior partner solutions architect focusing on data and AI/ML at AWS based out of Houston, TX. He is passionate about helping partners and customers accelerate their cloud data journey. He is a trusted advisor in this field and has plenty of experience architecting and building scalable, resilient, and performant workloads in the cloud. Outside of work, he enjoys spending time with his family.

Read More

Experience the new and improved Amazon SageMaker Studio

Experience the new and improved Amazon SageMaker Studio

Launched in 2019, Amazon SageMaker Studio provides one place for all end-to-end machine learning (ML) workflows, from data preparation, building and experimentation, training, hosting, and monitoring. As we continue to innovate to increase data science productivity, we’re excited to announce the improved SageMaker Studio experience, which allows users to select the managed Integrated Development Environment (IDE) of their choice, while having access to the SageMaker Studio resources and tooling across the IDEs. This updated user experience (UX) provides data scientists, data engineers, and ML engineers more choice on where to build and train their ML models within SageMaker Studio. As a web application, SageMaker Studio has improved load time, faster IDE and kernel start up times, and automatic upgrades.

In addition to managed JupyterLab and RStudio on Amazon SageMaker, we have also launched managed Visual Studio Code open-source (Code-OSS) with SageMaker Studio. Once a user selects Code Editor and launches the Code Editor space backed by the compute and storage of their choice, they can take advantage of the SageMaker tooling and Amazon Toolkit, as well as integration with Amazon EMR, Amazon CodeWhisperer, GitHub, and the ability to customize the environment with custom images. As they can do today with JupyterLab and RStudio on SageMaker, users can switch the Code Editor compute on the fly based on their needs.

Lastly, in order to streamline the data science process and avoid users having to jump from the console to Amazon SageMaker Studio, we added the ability to view Training Jobs and Endpoint details in the SageMaker Studio user interface (UI) and have enabled the ability to view all running instances across launched applications. Additionally, we improved our Jumpstart foundation models (FMs) experience so users can quickly discover, import, register, fine tune, and deploy a FM.

Solution overview

Launch IDEs

With the new version of Amazon SageMaker Studio, the JupyterLab server is updated to provide faster startup times and a more reliable experience. SageMaker Studio is now a multi-tenant web application from where users can not only launch JupyterLab, but also have the option to launch Visual Studio Code open-source (Code-OSS), RStudio, and Canvas as managed applications. The SageMaker Studio UI enables you to access and discover SageMaker resources and ML tooling such as Jobs, Endpoints, and Pipelines in a consistent manner, regardless of your IDE of choice.
Amazon SageMaker Studio applications
Launch IDEs
SageMaker Studio contains a default private space that only you can access and run in JupyterLab or Code Editor.
Create JupyterLab private space
Create Code Editor private space
You also have the option to create a new space in SageMaker Studio Classic, which will be shared with all the users in your domain.
Create Studio Classic space

Enhanced ML Workflow

With the new interactive experience, there’re significant enhancements and a simplification of parts of the existing ML workflow from Amazon SageMaker. Specifically, within Training and Hosting there’s a much more intuitive UI-driven experience to create new jobs and endpoints while also providing metric tracking and monitoring interfaces.

Training

For training models on Amazon SageMaker, users can conduct training of varying flavors whether that is through a Studio Notebook through a Notebook Job, a dedicated Training Job, or a fine-tuning job via SageMaker JumpStart. With the enhanced UI experience, you can track past and current training jobs utilizing the Studio Training panel.
View Training jobs
You can also toggle between specific Training Jobs to understand performance, model artifacts location, and also configurations such as the hardware and hyperparameters behind a training job. The UI also gives the flexibility to be able to start and stop training jobs via the Console.
Training job details

Hosting

There are a variety of different Hosting options within Amazon SageMaker as well that you can utilize for model deployment within the UI. For creating a SageMaker Endpoint, you can go to the Models section where you can utilize existing models or create a new one.
View models
Here you can utilize either a singular model to deploy an Amazon SageMaker Real-Time Endpoint or multiple models to work with the Advanced SageMaker Hosting options.
Create an endpoint
Optionally for FMs, you can also utilize the Amazon SageMaker JumpStart panel to toggle between the list of available FMs and either fine-tune or deploy through the UI.
Amazon SageMaker Jumpstart panel

Setup

The updated Amazon SageMaker Studio experience is launching alongside the Amazon SageMaker Studio Classic experience. You can try out the new UI and choose to opt-in to make the updated experience the default option for new and existing domains. The documentation lists the steps to migrate from SageMaker Studio Classic.

Conclusion

In this post, we showed you the features available in the new and improved Amazon SageMaker Studio. With the updated SageMaker Studio experience, users now have the ability to select their preferred IDE backed by the compute of their choice and start the kernel within seconds, with access to SageMaker tooling and resources through the SageMaker Studio web application. The addition of Training and Endpoint details within SageMaker Studio, as well as the improved Amazon SageMaker Jumpstart UX, provides a seamless integration of ML steps within the SageMaker Studio UX. Get started on SageMaker Studio here.


About the Authors

Mair Hasco is an AI/ML Specialist for Amazon SageMaker Studio. She helps customers optimize their machine learning workloads using Amazon SageMaker.

Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Lauren Mullennex is a Senior AI/ML Specialist Solutions Architect at AWS. She has a decade of experience in DevOps, infrastructure, and ML. She is also the author of a book on computer vision. In her spare time, she enjoys traveling and hiking.

Khushboo Srivastava is a Senior Product Manager for Amazon SageMaker. She enjoys building products that simplify machine learning workflows for customers, and loves playing with her 1-year old daughter.

Read More

Amazon SageMaker simplifies setting up SageMaker domain for enterprises to onboard their users to SageMaker

Amazon SageMaker simplifies setting up SageMaker domain for enterprises to onboard their users to SageMaker

As organizations scale the adoption of machine learning (ML), they are looking for efficient and reliable ways to deploy new infrastructure and onboard teams to ML environments. One of the challenges is setting up authentication and fine-grained permissions for users based on their roles and activities. For example, MLOps engineers typically perform model deployment activities, whereas data scientists perform ML training and validation activities. Another challenge is the effort required to set up and manage the networking configurations. Typically, there is no simple mechanism for administrators to discover, implement, and manage the right networking and security configurations their teams need.

That’s why today we are excited to announce the new onboarding experience that makes it effortless for you to set up Amazon SageMaker domains for your organization. As a platform administrator, you can use the updated user interface (UI) and APIs to onboard users faster, with the right security settings and infrastructure.

Let’s see what’s new and how to get started!

Introducing the SageMaker domain setup UI for organizations

The new UI for organizations lets you set up a SageMaker domain via the AWS Console and onboard users and organizations with just a few clicks. The redesigned UI guides you through the setup and provides step-by-step instructions so that you can scale quickly. You can choose between using AWS Identity Access Management (IAM) or AWS IAM Identity Center authentication and map scoped-down policies to your existing groups or users. You can assign existing roles or create new ones based on their typical ML activities. An ML activity represents a set of permissions for a specific task, such as running ML training jobs.

In addition to setting up and configuring your SageMaker apps and execution roles, the new experience offers an updated UI for implementing complex networking configuration, such as VPC endpoints, subnets and security groups, and encryption settings. You can also manage your subnets and connection modes later on if changes are required.

Now let’s go through the new experience in more depth.

Prerequisites

Before you use the advanced setup for organizations, you need to have the following:

  • An AWS account
  • An IAM role with permissions to create the resources needed to set up a SageMaker domain

Set up a SageMaker domain for organizations

To experience the updated UI, the ML admin completes the following steps:

  1. On the SageMaker console, choose Set up for organizations.

    This takes you to the Set up SageMaker Domain wizard, where the Set up for organizations option is already selected.
  2. Choose Configure.
  3. On the Domain details page, enter a domain name, then choose Next.
  4. On the Users and ML Activities page, select your preferred authentication method. For this post, we select AWS Identity Center. Note that your AWS Identity Center setup must be in the same Region as where you are creating your SageMaker domain.
  5. In the Who will use Studio? section, you can optionally choose user groups to grant access to the SageMaker domain.
  6. Select Create a new role to create a new role to assign activities to, or use an existing role. For ML activities, select from the list of predefined activities.
  7. In the S3 Bucket Access section, enter an Amazon Simple Storage Service (Amazon S3) bucket that all the domain users will have access to, then choose Next. You can specify more than one S3 bucket.
  8. On the Applications page, you can specify and configure the integrated development environments (IDEs) available under the SageMaker domain. For SageMaker Studio, select the updated or classic version. You can also configure Canvas, Code Editor, and RStudio.
  9. Choose Next.
  10. On the Network page, select to use VPC only or public internet access. For this post, we select Virtual Private Cloud (VPC) Only. If you’re using a VPC, specify your VPC, subnets, and security groups, then choose Next.
  11. On the Storage page, you can optionally set an encryption key.
  12. You can also optionally configure the default and maximum space size for the Amazon Elastic Block Store (Amazon EBS) volume for the Amazon Elastic Compute Cloud (Amazon EC2) instance that hosts the JupyterLab and Code Editor.
  13. Choose Next.
  14. On the Review and create page, review your configurations, then choose Submit to create the domain.

  15. This starts the process of setting up the SageMaker domain, which takes 2–4 minutes to complete.
  16. When the domain is ready, a success banner appears.

New: Update existing domains for organizations

Now that we have gone through the user journey of an admin setting up a new SageMaker domain for organizations, the domain is ready and ML users can be onboarded to SageMaker. This process is not a one-time event; after creating the domains, the requirements may evolve and updates to the domain configuration are needed. Let’s explore some newly launched features as part of this setup that allow updates to existing domains.

Prerequisites to update domains

To use these new features, the ML admins must have access to:

Update a subnet in an existing domain via the AWS CLI

As organizations scale the adoption of ML, their needs evolve, which requires changes in their infrastructure. As you add more users and resources to your projects and teams, you require more resources (such as IP range and endpoints). You may also want to isolate a few subnets and disassociate these subnets from SageMaker Studio and therefore want to remove the subnets from your domains. One of the challenges admins face when you want to add or remove subnets is that updating the subnets of a domain requires expertise and time. We’re excited to announce that we have simplified this process, and ML admins now can update the subnets of a domain via the AWS CLI.

Let’s walk through this functionality.

In this example use case, you have created a new SageMaker Studio domain with two subnets: subnet-1 and subnet-2. You have exhausted all the domain subnet IPs and now want to add new subnets subnet-3 and subnet-4 to the domain. See the following code:

# Update Domain with a new Subnet being added
aws --region $REGION --endpoint-url $SAGEMAKER_ENDPOINT sagemaker update-domain --domain-id $DOMAIN_ID --subnet-ids '["subnet-1","subnet-2","subnet-3", "subnet-4"]'
# Describe the Domain to see if the Domain Subnet list got updated
aws --region $REGION --endpoint-url $SAGEMAKER_ENDPOINT sagemaker describe-domain --domain-id $DOMAIN_ID

If you realize that you don’t actually need so many IPs, you can remove a subnet (for this example, subnet-4) from the existing list of subnets. See the following code:

# Update Domain with a Subnet being removed
aws --region $REGION --endpoint-url $SAGEMAKER_ENDPOINT sagemaker update-domain --domain-id $DOMAIN_ID --subnet-ids '["subnet-1","subnet-2","subnet-3"]'
# Describe the Domain to see if the Domain Subnet list got updated
aws --region $REGION --endpoint-url $SAGEMAKER_ENDPOINT sagemaker describe-domain --domain-id $DOMAIN_ID

Change your network connection mode in an existing domain via the AWS CLI

When you’re conducting tests or exploring SageMaker to learn more about the service, you might create your domain with public internet access. However, as you set up projects and scale your ML workloads, you may need to change your authentication mode to VPC only to be compliant with your organization’s existing network and security requirements. We’re excited to announce that ML admins now can change their network connection mode from public internet to VPC only mode via the AWS CLI.

For example, in the following code, we update the domain AppNetworkAccessType to VpcOnly:

# Update Domain App Network Access type
aws --region $REGION --endpoint-url $SAGEMAKER_ENDPOINT sagemaker update-domain --domain-id $DOMAIN_ID --app-network-access-type VpcOnly

In the following code, we update the domain AppNetworkAccessType to PublicInternetOnly:

# Update Domain App Network Access type
aws --region $REGION --endpoint-url $SAGEMAKER_ENDPOINT sagemaker update-domain --domain-id $DOMAIN_ID --app-network-access-type PublicInternetOnly

Conclusion

The new UI for organizations to set up domains and the new features related to updating existing domains are available today at no additional charge in all AWS Regions where SageMaker is available, except for the AWS GovCloud and AWS China Regions.

Try out these new features and let us know what you think. We always look forward to your feedback! You can send it through your usual AWS Support contacts or post it on the AWS Forum for SageMaker.

To learn more, visit New onboarding experience in SageMaker and check Onboard to Amazon SageMaker Domain using IAM Identity Center.


About the authors

Ozan Eken is a Senior Product Manager at Amazon Web Services. He is passionate about building onboarding products with the right infrastructure, security guardrails and governance for SageMaker. Outside of work, he likes exploring different outdoor activities and watching soccer.

Vikesh Pandey is a Machine Learning Specialist Solutions Architect at AWS, helping customers from financial industries design and build solutions on generative AI and ML. Outside of work, Vikesh enjoys trying out different cuisines and playing outdoor sports.

Anastasia Tzeveleka is a Machine Learning and AI Specialist Solutions Architect at AWS. She works with customers in EMEA and helps them architect machine learning solutions at scale using AWS services. She has worked on projects in different domains including Natural Language Processing (NLP), MLOps and Low Code No Code tools.

Read More

Welcome to a New Era of Building in the Cloud with Generative AI on AWS

Welcome to a New Era of Building in the Cloud with Generative AI on AWS

We believe generative AI has the potential over time to transform virtually every customer experience we know. The number of companies launching generative AI applications on AWS is substantial and building quickly, including adidas, Booking.com, Bridgewater Associates, Clariant, Cox Automotive, GoDaddy, and LexisNexis Legal & Professional, to name just a few. Innovative startups like Perplexity AI are going all in on AWS for generative AI. Leading AI companies like Anthropic have selected AWS as their primary cloud provider for mission-critical workloads, and the place to train their future models. And global services and solutions providers like Accenture are reaping the benefits of customized generative AI applications as they empower their in-house developers with Amazon CodeWhisperer.

These customers are choosing AWS because we are focused on doing what we’ve always done—taking complex and expensive technology that can transform customer experiences and businesses and democratizing it for customers of all sizes and technical abilities. To do this, we’re investing and rapidly innovating to provide the most comprehensive set of capabilities across the three layers of the generative AI stack. The bottom layer is the infrastructure to train Large Language Models (LLMs) and other Foundation Models (FMs) and produce inferences or predictions. The middle layer is easy access to all of the models and tools customers need to build and scale generative AI applications with the same security, access control, and other features customers expect from an AWS service. And at the top layer, we’ve been investing in game-changing applications in key areas like generative AI-based coding. In addition to offering them choice and—as they expect from us—breadth and depth of capabilities across all layers, customers also tell us they appreciate our data-first approach, and trust that we’ve built everything from the ground up with enterprise-grade security and privacy.

This week we took a big step forward, announcing many significant new capabilities across all three layers of the stack to make it easy and practical for our customers to use generative AI pervasively in their businesses.

Bottom layer of the stack: AWS Trainium2 is the latest addition to deliver the most advanced cloud infrastructure for generative AI

The bottom layer of the stack is the infrastructure—compute, networking, frameworks, services—required to train and run LLMs and other FMs. AWS innovates to offer the most advanced infrastructure for ML. Through our long-standing collaboration with NVIDIA, AWS was the first to bring GPUs to the cloud more than 12 years ago, and most recently we were the first major cloud provider to make NVIDIA H100 GPUs available with our P5 instances. We continue to invest in unique innovations that make AWS the best cloud to run GPUs, including the price-performance benefits of the most advanced virtualization system (AWS Nitro), powerful petabit-scale networking with Elastic Fabric Adapter (EFA), and hyper-scale clustering with Amazon EC2 UltraClusters (thousands of accelerated instances co-located in an Availability Zone and interconnected in a non-blocking network that can deliver up to 3,200 Gbps for massive-scale ML training). We are also making it easier for any customer to access highly sought-after GPU compute capacity for generative AI with Amazon EC2 Capacity Blocks for ML—the first and only consumption model in the industry that lets customers reserve GPUs for future use (up to 500 deployed in EC2 UltraClusters) for short duration ML workloads.

Several years ago, we realized that to keep pushing the envelope on price performance we would need to innovate all the way down to the silicon, and we began investing in our own chips. For ML specifically, we started with AWS Inferentia, our purpose-built inference chip. Today, we are on our second generation of AWS Inferentia with Amazon EC2 Inf2 instances that are optimized specifically for large-scale generative AI applications with models containing hundreds of billions of parameters. Inf2 instances offer the lowest cost for inference in the cloud while also delivering up to four times higher throughput and up to ten times lower latency compared to Inf1 instances. Powered by up to 12 Inferentia2 chips, Inf2 are the only inference-optimized EC2 instances that have high-speed connectivity between accelerators so customers can run inference faster and more efficiently (at lower cost) without sacrificing performance or latency by distributing ultra-large models across multiple accelerators. Customers like Adobe, Deutsche Telekom, and Leonardo.ai have seen great early results and are excited to deploy their models at scale on Inf2.

On the training side, Trn1 instances—powered by AWS’s purpose-built ML training chip, AWS Trainium—are optimized to distribute training across multiple servers connected with EFA networking. Customers like Ricoh have trained a Japanese LLM with billions of parameters in mere days. Databricks is getting up to 40% better price-performance with Trainium-based instances to train large-scale deep learning models. But with new, more capable models coming out practically every week, we are continuing to push the boundaries on performance and scale, and we are excited to announce AWS Trainium2, designed to deliver even better price performance for training models with hundreds of billions to trillions of parameters. Trainium2 should deliver up to four times faster training performance than first-generation Trainium, and when used in EC2 UltraClusters should deliver up to 65 exaflops of aggregate compute. This means customers will be able to train a 300 billion parameter LLM in weeks versus months. Trainium2’s performance, scale, and energy efficiency are some of the reasons why Anthropic has chosen to train its models on AWS, and will use Trainium2 for its future models. And we are collaborating with Anthropic on continued innovation with both Trainium and Inferentia. We expect our first Trainium2 instances to be available to customers in 2024.

We’ve also been doubling down on the software tool chain for our ML silicon, specifically in advancing AWS Neuron, the software development kit (SDK) that helps customers get the maximum performance from Trainium and Inferentia. Since introducing Neuron in 2019 we’ve made substantial investments in compiler and framework technologies, and today Neuron supports many of the most popular publicly available models, including Llama 2 from Meta, MPT from Databricks, and Stable Diffusion from Stability AI, as well as 93 of the top 100 models on the popular model repository Hugging Face. Neuron plugs into popular ML frameworks like PyTorch and TensorFlow, and support for JAX is coming early next year. Customers are telling us that Neuron has made it easy for them to switch their existing model training and inference pipelines to Trainium and Inferentia with just a few lines of code.

Nobody else offers this same combination of choice of the best ML chips, super-fast networking, virtualization, and hyper-scale clusters. And so, it’s not surprising that some of the most well-known generative AI startups like AI21 Labs, Anthropic, Hugging Face, Perplexity AI, Runway, and Stability AI run on AWS. But, you still need the right tools to effectively leverage this compute to build, train, and run LLMs and other FMs efficiently and cost-effectively. And for many of these startups, Amazon SageMaker is the answer. Whether building and training a new, proprietary model from scratch or starting with one of the many popular publicly available models, training is a complex and expensive undertaking. It’s also not easy to run these models cost-effectively. Customers must acquire large amounts of data and prepare it. This typically involves a lot of manual work cleaning data, removing duplicates, enriching and transforming it. Then they have to create and maintain large clusters of GPUs/accelerators, write code to efficiently distribute model training across clusters, frequently checkpoint, pause, inspect and optimize the model, and manually intervene and remediate hardware issues in the cluster. Many of these challenges aren’t new, they’re some of the reasons why we launched SageMaker six years ago—to break down the many barriers involved in model training and deployment and give developers a much easier way. Tens of thousands of customers use Amazon SageMaker, and an increasing number of them like LG AI Research, Perplexity AI, AI21, Hugging Face, and Stability AI are training LLMs and other FMs on SageMaker. Just recently, Technology Innovation Institute (creators of the popular Falcon LLMs) trained the largest publicly available model—Falcon 180B—on SageMaker. As model sizes and complexity have grown, so has SageMaker’s scope.

Over the years, we’ve added more than 380 game-changing features and capabilities to Amazon SageMaker like automatic model tuning, distributed training, flexible model deployment options, tools for ML OPs, tools for data preparation, feature stores, notebooks, seamless integration with human-in-the-loop evaluations across the ML lifecycle, and built-in features for responsible AI. We keep innovating rapidly to make sure SageMaker customers are able to keep building, training, and running inference for all models—including LLMs and other FMs. And we’re making it even easier and more cost-effective for customers to train and deploy large models with two new capabilities. First, to simplify training we’re introducing Amazon SageMaker HyperPod which automates more of the processes required for high-scale fault-tolerant distributed training (e.g., configuring distributed training libraries, scaling training workloads across thousands of accelerators, detecting and repairing faulty instances), speeding up training by as much as 40%. As a result, customers like Perplexity AI, Hugging Face, Stability, Hippocratic, Alkaid, and others are using SageMaker HyperPod to build, train, or evolve models. Second, we’re introducing new capabilities to make inference more cost-effective while reducing latency. SageMaker now helps customers deploy multiple models to the same instance so that they can share compute resources—reducing inference cost by 50% (on average). SageMaker also actively monitors instances that are processing inference requests and intelligently routes requests based on which instances are available—achieving 20% lower inference latency (on average). Conjecture, Salesforce, and Slack are already using SageMaker for hosting models due to these inference optimizations.

Middle layer of the stack: Amazon Bedrock adds new models and a wave of new capabilities make it even easier for customers to securely build and scale generative AI applications

While a number of customers will build their own LLMs and other FMs, or evolve any number of the publicly available options, many will not want to spend the resources and time to do this. For them, the middle layer of the stack offers these models as a service. Our solution here, Amazon Bedrock, allows customers to choose from industry-leading models from Anthropic, Stability AI, Meta, Cohere, AI21, and Amazon, customize them with their own data, and leverage all of the same leading security, access controls, and features they are used to in AWS—all through a managed service. We made Amazon Bedrock generally available in late September, and customer response has been overwhelmingly positive. Customers from around the world and across virtually every industry are excited to use Amazon Bedrock. adidas is enabling developers to get quick answers on everything from “getting started” info to deeper technical questions. Booking.com intends to use generative AI to write up tailored trip recommendations for every customer. Bridgewater Associates is developing an LLM-powered Investment Analyst Assistant to help generate charts, compute financial indicators, and summarize results. Carrier is making more precise energy analytics and insights accessible to customers so they reduce energy consumption and cut carbon emissions. Clariant is empowering its team members with an internal generative AI chatbot to accelerate R&D processes, support sales teams with meeting preparation, and automate customer emails. GoDaddy is helping customers easily set up their businesses online by using generative AI to build their websites, find suppliers, connect with customers, and more. Lexis Nexis Legal & Professional is transforming legal work for lawyers and increasing their productivity with Lexis+ AI conversational search, summarization, and document drafting and analysis capabilities. Nasdaq is helping to automate investigative workflows on suspicious transactions and strengthen their anti–financial crime and surveillance capabilities. All of these—and many more—diverse generative AI applications are running on AWS.

We are excited about the momentum for Amazon Bedrock, but it is still early days. What we’ve seen as we’ve worked with customers is that everyone is moving fast, but the evolution of generative AI continues at a rapid pace with new options and innovations happening practically daily. Customers are finding there are different models that work better for different use cases, or on different sets of data. Some models are great for summarization, others are great for reasoning and integration, and still others have really awesome language support. And then there is image generation, search use cases, and more—all coming from both proprietary models and from models that are publicly available to anyone. And in times when there is so much that is unknowable, the ability to adapt is arguably the most valuable tool of all. There is not going to be one model to rule them all. And certainly not just one technology company providing the models that everyone uses. Customers need to be trying out different models. They need to be able to switch between them or combine them within the same use case. This means they need a real choice of model providers (which the events of the past 10 days have made even more clear). This is why we invented Amazon Bedrock, why it resonates so deeply with customers, and why we are continuing to innovate and iterate quickly to make building with (and moving between) a range of models as easy as an API call, put the latest techniques for model customization in the hands of all developers, and keep customers secure and their data private. We’re excited to introduce several new capabilities that will make it even easier for customers to build and scale generative AI applications:

  • Expanding model choice with Anthropic Claude 2.1, Meta Llama 2 70B, and additions to the Amazon Titan family. In these early days, customers are still learning and experimenting with different models to determine which ones they want to use for various purposes. They want to be able to easily try the latest models, and also test to see which capabilities and features will give them the best results and cost characteristics for their use cases. With Amazon Bedrock, customers are only ever one API call away from a new model. Some of the most impressive results customers have experienced these last few months are from LLMs like Anthropic’s Claude model, which excels at a wide range of tasks from sophisticated dialog and content generation to complex reasoning while maintaining a high degree of reliability and predictability. Customers report that Claude is much less likely to produce harmful outputs, easier to converse with, and more steerable compared to other FMs, so developers can get their desired output with less effort. Anthropic’s state-of-the-art model, Claude 2, scores above the 90th percentile on the GRE reading and writing exams, and similarly on quantitative reasoning. And now, the newly released Claude 2.1 model is available in Amazon Bedrock. Claude 2.1 delivers key capabilities for enterprises such as an industry-leading 200K token context window (2x the context of Claude 2.0), reduced rates of hallucination, and significant improvements in accuracy, even at very long context lengths. Claude 2.1 also includes improved system prompts – which are model instructions that provide a better experience for end users – while also reducing the cost of prompts and completions by 25%.

    For a growing number of customers who want to use a managed version of Meta’s publicly available Llama 2 model, Amazon Bedrock offers Llama 2 13B, and we’re adding Llama 2 70B. Llama 2 70B is suitable for large-scale tasks such as language modeling, text generation, and dialogue systems. The publicly available Llama models have been downloaded more than 30M times, and customers love that Amazon Bedrock offers them as part of a managed service where they don’t need to worry about infrastructure or have deep ML expertise on their teams. Additionally, for image generation, Stability AI offers a suite of popular text-to-image models. Stable Diffusion XL 1.0 (SDXL 1.0) is the most advanced of these, and it is now generally available in Amazon Bedrock. The latest edition of this popular image model has increased accuracy, better photorealism, and higher resolution.

    Customers are also using Amazon Titan models, which are created and pretrained by AWS to offer powerful capabilities with great economics for a variety of use cases. Amazon has a 25 year track record in ML and AI—technology we use across our businesses—and we have learned a lot about building and deploying models. We have carefully chosen how we train our models and the data we use to do so. We indemnify customers against claims that our models or their outputs infringe on anyone’s copyright. We introduced our first Titan models in April of this year. Titan Text Lite—now generally available—is a succinct, cost-effective model for use cases like chatbots, text summarization, or copywriting, and it is also compelling to fine-tune. Titan Text Express—also now generally available—is more expansive, and can be used for a wider range of text-based tasks, such as open-ended text generation and conversational chat. We offer these text model options to give customers the ability to optimize for accuracy, performance, and cost depending on their use case and business requirements. Customers like Nexxiot, PGA Tour, and Ryanair are using our two Titan Text models. We also have an embeddings model, Titan Text Embeddings, for search use cases and personalization. Customers like Nasdaq are seeing great results using Titan Text Embeddings to enhance capabilities for Nasdaq IR Insight to generate insights from 9,000+ global companies’ documents for sustainability, legal, and accounting teams. And we’ll continue to add more models to the Titan family over time. We are introducing a new embeddings model, Titan Multimodal Embeddings, to power multimodal search and recommendation experiences for users using images and text (or a combination of both) as inputs. And we are introducing a new text-to-image model, Amazon Titan Image Generator. With Titan Image Generator, customers across industries like advertising, e-commerce, and media and entertainment can use a text input to generate realistic, studio-quality images in large volumes and at low cost. We are excited about how customers are responding to Titan Models, and you can expect that we’ll continue to innovate here.

  • New capabilities to customize your generative AI application securely with your proprietary data: One of the most important capabilities of Amazon Bedrock is how easy it is to customize a model. This becomes truly exciting for customers because it’s where generative AI meets their core differentiator—their data. However, it is really important that their data remains secure, that they have control of it along the way, and that model improvements are private to them. There are a few ways that you can do this, and Amazon Bedrock offers the broadest selection of customization options across multiple models). The first is fine tuning. Fine tuning a model in Amazon Bedrock is easy. You simply select the model and Amazon Bedrock makes a copy of it. Then you point to a few labeled examples (e.g., a series of good question-answer pairs) that you store in Amazon Simple Storage Service (Amazon S3), and Amazon Bedrock “incrementally trains” (augments the copied model with the new information) on these examples, and the result is a private, more accurate fine-tuned model that delivers more relevant, customized responses. We are excited to announce that fine tuning is generally available for Cohere Command, Meta Llama 2, Amazon Titan Text (Lite and Express), Amazon Titan Multimodal Embeddings, and in preview for Amazon Titan Image Generator. And, through our collaboration with Anthropic, we will soon provide AWS customers early access to unique features for model customization and fine-tuning of its state-of-the-art model Claude.

    A second technique for customizing LLMs and other FMs for your business is retrieval augmented generation (RAG), which allows you to customize a model’s responses by augmenting your prompts with data from multiple sources, including document repositories, databases, and APIs. In September, we introduced a RAG capability, Knowledge Bases for Amazon Bedrock, that securely connects models to your proprietary data sources to supplement your prompts with more information so your applications deliver more relevant, contextual, and accurate responses. Knowledge Bases is now generally available with an API that performs the entire RAG workflow from fetching text needed to augment a prompt, to sending the prompt to the model, to returning the response. Knowledge Bases supports databases with vector capabilities that store numerical representations of your data (embeddings) that models use to access this data for RAG, including Amazon OpenSearch Service, and other popular databases like Pinecone and Redis Enterprise Cloud (Amazon Aurora and MongoDB vector support coming soon).

    The third way you can customize models in Amazon Bedrock is with continued pre-training. With this method, the model builds on its original pre-training for general language understanding to learn domain-specific language and terminology. This approach is for customers who have large troves of unlabeled, domain-specific information and want to enable their LLMs to understand the language, phrases, abbreviations, concepts, definitions, and jargon unique to their world (and business). Unlike in fine-tuning, which takes a fairly small amount of data, continued pre-training is performed on large data sets (e.g., thousands of text documents). Now, pre-training capabilities are available in Amazon Bedrock for Titan Text Lite and Titan Text Express.

  • General availability of Agents for Amazon Bedrock to help execute multistep tasks using systems, data sources, and company knowledge. LLMs are great at having conversations and generating content, but customers want their applications to be able to do even more—like take actions, solve problems, and interact with a range of systems to complete multi-step tasks like booking travel, filing insurance claims, or ordering replacement parts. And Amazon Bedrock can help with this challenge. With agents, developers select a model, write a few basic instructions like “you are a cheerful customer service agent” and “check product availability in the inventory system,” point the selected model to the right data sources and enterprise systems (e.g., CRM or ERP applications), and write a few AWS Lambda functions to execute the APIs (e.g., check availability of an item in the ERP inventory). Amazon Bedrock automatically analyzes the request and breaks it down into a logical sequence using the selected model’s reasoning capabilities to determine what information is needed, what APIs to call, and when to call them to complete a step or solve a task. Now generally available, agents can plan and perform most business tasks—from answering customer questions about your product availability to taking their orders—and developers don’t need to be familiar with machine learning, engineer prompts, train models, or manually connect systems. And Bedrock does all of this securely and privately, and customers like Druva and Athene are already using them to improve the accuracy and speed of development of their generative AI applications.
  • Introducing Guardrails for Amazon Bedrock so you can apply safeguards based on your use case requirements and responsible AI policies. Customers want to be sure that interactions with their AI applications are safe, avoid toxic or offensive language, stay relevant to their business, and align with their responsible AI policies. With guardrails, customers can specify topics to avoid, and Amazon Bedrock will only provide users with approved responses to questions that fall in those restricted categories. For example, an online banking application can be set up to avoid providing investment advice, and remove inappropriate content (such as hate speech and violence). In early 2024, customers will also be able to redact personally identifiable information (PII) in model responses. For example, after a customer interacts with a call center agent the customer service conversation is often summarized for record keeping, and guardrails can remove PII from those summaries. Guardrails can be used across models in Amazon Bedrock (including fine-tuned models), and with Agents for Amazon Bedrock so customers can bring a consistent level of protection to all of their generative AI applications.

Top layer of the stack: Continued innovation makes generative AI accessible to more users

At the top layer of the stack are applications that leverage LLMs and other FMs so that you can take advantage of generative AI at work. One area where generative AI is already changing the game is in coding. Last year, we introduced Amazon CodeWhisperer, which helps you build applications faster and more securely by generating code suggestions and recommendations in near real-time. Customers like Accenture, Boeing, Bundesliga, The Cigna Group, Kone, and Warner Music Group are using CodeWhisperer to increase developer productivity—and Accenture is enabling up to 50,000 of their software developers and IT professionals with Amazon CodeWhisperer. We want as many developers as possible to be able to get the productivity benefits of generative AI, which is why CodeWhisperer offers recommendations for free to all individuals.

However, while AI coding tools do a lot to make developers’ lives easier, their productivity benefits are limited by their lack of knowledge of internal code bases, internal APIs, libraries, packages and classes. One way to think about this is that if you hire a new developer, even if they’re world-class, they’re not going to be that productive at your company until they understand your best practices and code. Today’s AI-powered coding tools are like that new-hire developer. To help with this, we recently previewed a new customization capability in Amazon CodeWhisperer that securely leverages a customer’s internal code base to provide more relevant and useful code recommendations. With this capability, CodeWhisperer is an expert on your code and provides recommendations that are more relevant to save even more time. In a study we did with Persistent, a global digital engineering and enterprise modernization company, we found that customizations help developers complete tasks up to 28% faster than with CodeWhisperer’s general capabilities. Now a developer at a healthcare technology company can ask CodeWhisperer to “import MRI images associated with the customer ID and run them through the image classifier“ to detect anomalies. Because CodeWhisperer has access to the code base it can provide much more relevant suggestions that include the import locations of the MRI images and customer IDs. CodeWhisperer keeps customizations completely private, and the underlying FM does not use them for training, protecting customers’ valuable intellectual property. AWS is the only major cloud provider that offers a capability like this to everyone.

Introducing Amazon Q, the generative AI-powered assistant tailored for work

Developers certainly aren’t the only ones who are getting hands on with generative AI—millions of people are using generative AI chat applications. What early providers have done in this space is exciting and super useful for consumers, but in a lot of ways they don’t quite “work” at work. Their general knowledge and capabilities are great, but they don’t know your company, your data, your customers, your operations, or your business. That limits how much they can help you. They also don’t know much about your role—what work you do, who you work with, what information you use, and what you have access to. These limitations are understandable because these assistants don’t have access to your company’s private information, and they weren’t designed to meet the data privacy and security requirements companies need to give them this access. It’s hard to bolt on security after the fact and expect it to work well. We think we have a better way, which will allow every person in every organization to use generative AI safely in their day-to-day work.

We are excited to introduce Amazon Q, a new type of generative AI-powered assistant that is specifically for work and can be tailored to your business. Q can help you get fast, relevant answers to pressing questions, solve problems, generate content, and take actions using the data and expertise found in your company’s information repositories, code, and enterprise systems. When you chat with Amazon Q, it provides immediate, relevant information and advice to help streamline tasks, speed decision-making, and help spark creativity and innovation at work. We have built Amazon Q to be secure and private, and it can understand and respect your existing identities, roles, and permissions and use this information to personalize its interactions. If a user doesn’t have permission to access certain data without Q, they can’t access it using Q either. We have designed Amazon Q to meet stringent enterprise customers’ requirements from day one—none of their content is used to improve the underlying models.

Amazon Q is your expert assistant for building on AWS: We’ve trained Amazon Q on 17 years’ worth of AWS knowledge and experience so it can transform the way you build, deploy, and operate applications and workloads on AWS. Amazon Q has a chat interface in the AWS Management Console and documentation, your IDE (via CodeWhisperer), and your team chat rooms on Slack or other chat apps. Amazon Q can help you explore new AWS capabilities, get started faster, learn unfamiliar technologies, architect solutions, troubleshoot, upgrade, and much more —it’s an expert in AWS well-architected patterns, best practices, documentation, and solutions implementations. Here are some examples of what you can do with your new AWS expert assistant:

  • Get crisp answers and guidance on AWS capabilities, services, and solutions: Ask Amazon Q to “Tell me about Agents for Amazon Bedrock,” and Q will give you a description of the feature plus links to relevant materials. You can also Ask Amazon Q virtually any question about how an AWS service works (e.g., “What are the scaling limits on a DynamoDB table?” “What is Redshift Managed Storage?”), or how to best architect any number of solutions (“What are the best practices for building event-driven architectures?”). And Amazon Q will pull together succinct answers and always cite (and link to) its sources.
  • Choose the best AWS service for your use case, and get started quickly: Ask Amazon Q “What are the ways to build a Web app on AWS? ” and it will provide a list of potential services like AWS Amplify, AWS Lambda, and Amazon EC2 with the advantages of each. From there you can narrow down the options by helping Q understand your requirements, preferences, and constraints (e.g., “Which of these would be best if I want to use containers?” or “Should I use a relational or non-relational database?”). Finish up with “How do I get started?” and Amazon Q will outline some basic steps and point you towards additional resources.
  • Optimize your compute resources: Amazon Q can help you select Amazon EC2 instances. If you ask it to “Help me find the right EC2 instance to deploy a video encoding workload for my gaming app with the highest performance”, Q will get you a list of instance families with reasons for each suggestion. And, you can ask any number of follow up questions to help find the best choice for your workload.
  • Get assistance debugging, testing, and optimizing your code: If you encounter an error while coding in your IDE, you can ask Amazon Q to help by saying, “My code has an IO error, can you provide a fix?” and Q will generate the code for you. If you like the suggestion, you can ask Amazon Q to add the fix to your application. Since Amazon Q is in your IDE, it understands the code you are working on and knows where to insert the fix. Amazon Q can also create unit tests (“Write unit tests for the selected function”) that it can insert into your code and you can run. Finally, Amazon Q can tell you ways to optimize your code for higher performance. Ask Q to “Optimize my selected DynamoDB query,” and it will use its understanding of your code to provide a natural language suggestion on what to fix along with the accompanying code you can implement in one click.
  • Diagnose and troubleshoot issues: If you encounter issues in the AWS Management Console, like EC2 permissions errors or Amazon S3 configuration errors, you can simply press the “Troubleshoot with Amazon Q” button, and it will use its understanding of the error type and service where the error is located to give you a suggestions for a fix. You can even ask Amazon Q to troubleshoot your network (e.g., “Why can’t I connect to my EC2 instance using SSH?”) and Q will analyze your end-to-end configuration and provide a diagnosis (e.g., “This instance appears to be in a private subnet, so public accessibility may need to be established”).
  • Ramp up on a new code base in no time: When you chat with Amazon Q in your IDE, it combines its expertise in building software with an understanding of your code—a powerful pairing! Previously, if you took over a project from someone else, or you were new to the team, you might have to spend hours manually reviewing the code and documentation to understand how it works and what it does. Now, since Amazon Q understands the code in your IDE, you can simply ask Amazon Q to explain the code (“Provide me with a description of what this application does and how it works”) and Q will give you details like which services the code uses and what different functions do (e.g., Q might answer with something like, “This application is building a basic support ticket system using Python Flask and AWS Lambda” and go on to describe each of its core capabilities, how they are implemented, and much more).
  • Clear your feature backlog faster: You can even ask Amazon Q to guide you through and automate much of the end-to-end process of adding a feature to your application in Amazon CodeCatalyst, our unified software development service for teams. To do this, you just assign Q a backlog task from your issues list – just like you would a teammate – and Q generates a step-by-step plan for how it will build and implement the feature. Once you approve the plan, Q will write the code and present the suggested changes to you as a code review. You can request rework (if necessary), approve and/or deploy!
  • Upgrade your code in a fraction of the time: Most developers actually only spend a fraction of their time writing new code and building new applications. They spend a lot more of their cycles on painful, sloggy areas like maintenance and upgrades. Take language version upgrades. A large number of customers continue using older versions of Java because it will take months—even years—and thousands of hours of developer time to upgrade. Putting this off has real costs and risks—you miss out on performance improvements and are vulnerable to security issues. We think Amazon Q can be a game changer here, and are excited about Amazon Q Code Transformation, a feature which can remove a lot of this heavy lifting and reduce the time it takes to upgrade applications from days to minutes. You just open the code you want to update in your IDE, and ask Amazon Q to “/transform” your code. Amazon Q will analyze the entire source code of the application, generate the code in the target language and version, and execute tests, helping you realize the security and performance enhancements of the latest language versions. Recently, a very small team of Amazon developers used Amazon Q Code Transformation to upgrade 1,000 production applications from Java 8 to Java 17 in just two days. The average time per application was less than 10 minutes. Today Amazon Q Code Transformation performs Java language upgrades from Java 8 or Java 11 to Java 17. Coming next (and soon) is the ability to transform .NET Framework to cross-platform .NET (with even more transformations to follow in the future).

Amazon Q is your business expert: You can connect Amazon Q to your business data, information, and systems so that it can synthesize everything and provide tailored assistance to help people solve problems, generate content, and take actions that are relevant to your business. Bringing Amazon Q to your business is easy. It has 40+ built-in connectors to popular enterprise systems such as Amazon S3, Microsoft 365, Salesforce, ServiceNow, Slack, Atlassian, Gmail, Google Drive, and Zendesk. It can also connect to your internal intranet, wikis, and run books, and with the Amazon Q SDK, you can build a connection to whichever internal application you would like. Point Amazon Q at these repositories, and it will “ramp up” on your business, capturing and understanding the semantic information that makes your company unique. Then, you get your own friendly and simple Amazon Q web application so that employees across your company can interact with the conversational interface. Amazon Q also connects to your identity provider to understand a user, their role, and what systems they are permitted to access so that users can ask detailed, nuanced questions and get tailored results that include only information they are authorized to see. Amazon Q generates answers and insights that are accurate and faithful to the material and knowledge that you provide it, and you can restrict sensitive topics, block keywords, or filter out inappropriate questions and answers. Here are a few examples of what you can do with your business’s new expert assistant:

  • Get crisp, super-relevant answers based on your business data and information: Employees can ask Amazon Q about anything they might have previously had to search around for across all kinds of sources. Ask “What are the latest guidelines for logo usage?”, or “How do I apply for a company credit card?”, and Amazon Q will synthesize all of the relevant content it finds and come back with fast answers plus links to the relevant sources (e.g., brand portals and logo repositories, company T&E policies, and card applications).
  • Streamline day-to-day communications: Just ask, and Amazon Q can generate content (“Create a blog post and three social media headlines announcing the product described in this documentation”), create executive summaries (“Write a summary of our meeting transcript with a bulleted list of action items”), provide email updates (“Draft an email highlighting our Q3 training programs for customers in India”), and help structure meetings (“Create a meeting agenda to talk about the latest customer satisfaction report”).
  • Complete tasks: Amazon Q can help complete certain tasks, reducing the amount of time employees spend on repetitive work like filing tickets. Ask Amazon Q to “Summarize customer feedback on the new pricing offer in Slack,” and then request that Q take that information and open a ticket in Jira to update the marketing team. You can ask Q to “Summarize this call transcript,” and then “Open a new case for Customer A in Salesforce.” Amazon Q supports other popular work automation tools like Zendesk and Service Now.

Amazon Q is in Amazon QuickSight: With Amazon Q in QuickSight, AWS’s business intelligence service, users can ask their dashboards questions like “Why did the number of orders increase last month?” and get visualizations and explanations of the factors that influenced the increase. And, analysts can use Amazon Q to reduce the time it takes them to build dashboards from days to minutes with a simple prompt like “Show me sales by region by month as a stacked bar chart.” Q comes right back with that diagram, and you can easily add it to a dashboard or chat further with Q to refine the visualization (e.g., “Change the bar chart into a Sankey diagram,” or “Show countries instead of regions”). Amazon Q in QuickSight also makes it easier to use existing dashboards to inform business stakeholders, distill key insights, and simplify decision-making using data stories. For example, users may prompt Amazon Q to “Build a story about how the business has changed over the last month for a business review with senior leadership,” and in seconds, Amazon Q delivers a data-driven story that is visually compelling and is completely customizable. These stories can be shared securely throughout the organization to help align stakeholders and drive better decisions.

Amazon Q is in Amazon Connect: In Amazon Connect, our contact center service, Amazon Q helps your customer service agents provide better customer service. Amazon Q leverages the knowledge repositories your agents typically use to get information for customers, and then agents can chat with Amazon Q directly in Connect to get answers that help them respond more quickly to customer requests without needing to search through the documentation themselves. And, while chatting with Amazon Q for super-fast answers is great, in customer service there is no such thing as too fast. That’s why Amazon Q In Connect turns a live customer conversation with an agent into a prompt, and automatically providing the agent possible responses, suggested actions, and links to resources. For example, Amazon Q can detect that a customer is contacting a rental car company to change their reservation, generate a response for the agent to quickly communicate how the company’s change fee policies apply, and guide the agent through the steps they need to update the reservation.

Amazon Q is in AWS Supply Chain (Coming Soon): In AWS Supply Chain, our supply chain insights service, Amazon Q helps supply and demand planners, inventory managers, and trading partners optimize their supply chain by summarizing and highlighting potential stockout or overstock risks, and visualize scenarios to solve the problem. Users can ask Amazon Q “what,” “why,” and “what if” questions about their supply chain data and chat through complex scenarios and the tradeoffs between different supply chain decisions. For example, a customer may ask, “What’s causing the delay in my shipments and how can I speed things up?” to which Amazon Q may reply, “90% of your orders are on the east coast, and a big storm in the Southeast is causing a 24-hour delay. If you ship to the port of New York instead of Miami, you’ll expedite deliveries and reduce costs by 50%.”

Our customers are adopting generative AI quickly—they are training groundbreaking models on AWS, they are developing generative AI applications at record speed using Amazon Bedrock, and they are deploying game-changing applications across their organizations like Amazon Q. With our latest announcements, AWS is bringing customers even more performance, choice, and innovation to every layer of the stack. The combined impact of all the capabilities we’re delivering at re:Invent marks a major milestone toward meeting an exciting and meaningful goal: We are making generative AI accessible to customers of all sizes and technical abilities so they can get to reinventing and transforming what is possible.

Resources


About the Author

Swami Sivasubramanian is Vice President of Data and Machine Learning at AWS. In this role, Swami oversees all AWS Database, Analytics, and AI & Machine Learning services. His team’s mission is to help organizations put their data to work with a complete, end-to-end data solution to store, access, analyze, and visualize, and predict.

Read More

Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 2: Interactive User Experiences in SageMaker Studio

Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 2: Interactive User Experiences in SageMaker Studio

Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning (ML) models at scale. SageMaker makes it easy to deploy models into production directly through API calls to the service. Models are packaged into containers for robust and scalable deployments.

SageMaker provides a variety of options to deploy models. These options vary in the amount of control you have and the work needed at your end. The AWS SDK gives you most control and flexibility. It’s a low-level API available for Java, C++, Go, JavaScript, Node.js, PHP, Ruby, and Python. The SageMaker Python SDK is a high-level Python API that abstracts some of the steps and configuration, and makes it easier to deploy models. The AWS Command Line Interface (AWS CLI) is another high-level tool that you can use to interactively work with SageMaker to deploy models without writing your own code.

We are launching two new options that further simplify the process of packaging and deploying models using SageMaker. One way is for programmatic deployment. For that, we are offering improvements in the Python SDK. For more information, refer to Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 1: PySDK Improvements. The second way is for interactive deployment. For that, we are launching a new interactive experience in Amazon SageMaker Studio. It will help you quickly deploy your own trained or foundational models (FMs) from Amazon SageMaker JumpStart with optimized configuration, and achieve predictable performance at the lowest cost. Read on to check out what the new interactive experience looks like.

New interactive experience in SageMaker Studio

This post assumes that you have trained one or more ML models or are using FMs from the SageMaker JumpStart model hub and are ready to deploy them. Training a model using SageMaker is not a prerequisite for deploying model using SageMaker. Some familiarity with SageMaker Studio is also assumed.

We walk you through how to do the following:

  • Create a SageMaker model
  • Deploy a SageMaker model
  • Deploy a SageMaker JumpStart large language model (LLM)
  • Deploy multiple models behind one endpoint
  • Test model inference
  • Troubleshoot errors

Create a SageMaker model

The first step in setting up a SageMaker endpoint for inference is to create a SageMaker model object. This model object is made up of two things: a container for the model, and the trained model that will be used for inference. The new interactive UI experience makes the SageMaker model creation process straightforward. If you’re new to SageMaker Studio, refer to the Developer Guide to get started.

  1. In the SageMaker Studio interface, choose Models in the navigation pane.
  2. On the Deployable models tab, choose Create.

Now all you need to do is provide the model container details, the location of your model data, and an AWS Identity and Access Management (IAM) role for SageMaker to assume on your behalf.

  1. For the model container, you can use one of the SageMaker pre-built Docker images that it provides for popular frameworks and libraries. If you choose to use this option, choose a container framework, a corresponding framework version, and a hardware type from the list of supported types.

Alternatively, you can specify a path to your own container stored in Amazon Elastic Container Registry (Amazon ECR).

  1. Next, upload your model artifacts. SageMaker Studio provides two ways to upload model artifacts:
    • First, you can specify a model.tar.gz either in an Amazon Simple Storage Service (Amazon S3) bucket or in your local path. This model.tar.gz must be structured in a format that is compliant with the container that you are utilizing.
    • Alternatively, it supports raw artifact uploading for PyTorch and XGBoost models. For these two frameworks, provide the model artifacts in the format the container expects. For example, for PyTorch this would be a model.pth. Note that your model artifacts also include an inference script for preprocessing and postprocessing. If you don’t provide an inference script, the default inference handlers for the container you have chosen will be implemented.
  2. After you select your container and artifact, specify an IAM role.
  3. Choose Create deployable model to create a SageMaker model.

The preceding steps demonstrate the simplest workflow. You can further customize the model creation process. For example, you can specify VPC details and enable network isolation to make sure that the container can’t make outbound calls on the public internet. You can expand the Advanced options section to see more options.

You can get guidance on the hardware for best price/performance ratio to deploy your endpoint by running a SageMaker Inference Recommender benchmarking job. To further customize the SageMaker model, you can pass in any tunable environment variables at the container level. Inference Recommender will also take a range of these variables to find the optimal configuration for your model serving and container.

After you create your model, you can see it on the Deployable models tab. If there was any issue found in the model creation, you will see the status in the Monitor status column. Choose the model’s name to view the details.

Deploy a SageMaker model

In the most basic scenario, all you need to do is select a deployable model from the Models page or an LLM from the SageMaker JumpStart page, select an instance type, set the initial instance count, and deploy the model. Let’s see what this process looks like in SageMaker Studio for your own SageMaker model. We discuss using LLMs later in this post.

  1. On the Models page, choose the Deployable models tab.
  2. Select the model to deploy and choose Deploy.
  3. The next step is to select an instance type that SageMaker will put behind the inference endpoint.

You want an instance that delivers the best performance at the lowest cost. SageMaker makes it straightforward for you to make this decision by showing recommendations. If you had benchmarked your model using SageMaker Inference Recommender during the SageMaker model creation step, you will see the recommendations from that benchmark on the drop-down menu.

Otherwise, you will see a list of prospective instances on the menu. SageMaker uses its own heuristics to populate the list in that case.

  1. Specify the initial instance count, then choose Deploy.

SageMaker will create an endpoint configuration and deploy your model behind that endpoint. After the model is deployed, you will see the endpoint and model status as In service. Note that the endpoint may be ready before the model.

This is also the place in SageMaker Studio where you will manage the endpoint. You can navigate to the endpoint details page by choosing Endpoints under Deployments in the navigation pane. Use the Add model and Delete buttons to change the models behind the endpoint without needing to recreate an endpoint. The Test inference tab enables you to test your model by sending test requests to one of the in-service models directly from the SageMaker Studio interface. You can also edit the auto scaling policy on the Auto-scaling tab on this page. More details on adding, removing, and testing models are covered in the following sections. You can see the network, security, and compute information for this endpoint on the Settings tab.

Customize the deployment

The preceding example showed how straightforward it is to deploy a single model with minimum configuration required from your side. SageMaker populates most of the fields for you, but you can customize the configuration. For example, it automatically generates a name for the endpoint. However, you can name the endpoint according to your preference, or use an existing endpoint on the Endpoint name drop-down menu. For existing endpoints, you will see only the endpoints that are in service at that time. You can use the Advanced options section to specify an IAM role, VPC details, and tags.

Deploy a SageMaker JumpStart LLM

To deploy a SageMaker JumpStart LLM, complete the following steps:

  1. Navigate to the JumpStart page in SageMaker Studio.
  2. Choose one of the partner names to browse the list of models available from that partner, or use the search feature to get to the model page if you know the name of the model.
  3. Choose the model you want to deploy.
  4. Choose Deploy.

Note that use of LLMs is subject to EULA and the terms and conditions of the provider.

  1. Accept the license and terms.
  2. Specify an instance type.

Many models from the JumpStart model hub come with a price-performance optimized default instance type for deployment. For models that don’t come with this default, you will be provided with a list of supported instance types on the Instance type drop-down menu. For benchmarked models, if you want to optimize the deployment specifically for either cost or performance to meet your specific use case, you can choose Alternate configurations to view more options that have been benchmarked with different combinations of total tokens, input length, and max concurrency. You can also select from other supported instances for that model.

  1. If using an alternate configuration, select your instance and choose Select.
  2. Choose Deploy to deploy the model.

You will see the endpoint and model status change to In service. You also have options to customize the deployment to meet your requirements in this case.

Deploy multiple models behind one endpoint

SageMaker enables you to deploy multiple models behind a single endpoint. This reduces hosting costs by improving endpoint utilization compared to using endpoints with only one model behind them. It also reduces deployment overhead because SageMaker manages loading models in memory and scaling them based on the traffic patterns to your endpoint. SageMaker Studio now makes it straightforward to do this.

  1. Get started by selecting the models that you want to deploy, then choose Deploy.
  2. Then you can create an endpoint with multiple models that have an allocated amount of compute that you define.

In this case, we use an ml.p4d.24xlarge instance for the endpoint and allocate the necessary number of resources for our two different models. Note that your endpoint is constrained to the instance types that are supported by this feature.

  1. If you start the flow from the Deployable models tab and want to add a SageMaker JumpStart LLM, or vice versa, you can make it an endpoint fronting multiple models by choosing Add model after starting the deployment workflow.
  2. Here, you can choose another FM from the SageMaker JumpStart model hub or a model using the Deployable Models option, which refers to models that you have saved as SageMaker model objects.
  3. Choose your model settings:
    • If the model uses a CPU instance, choose the number of CPUs and minimum number of copies for the model.
    • If the model uses a GPU instance, choose the number of accelerators and minimum number of copies for the model.
  4. Choose Add model.
  5. Choose Deploy to deploy these models to a SageMaker endpoint.

When the endpoint is up and ready (In service status), you’ll have two models deployed behind a single endpoint.

Test model inference

SageMaker Studio now makes it straightforward to test model inference requests. You can send the payload data directly using the supported content type, such as application or JSON, text or CSV, or use Python SDK sample code to make an invocation request from your programing environment like a notebook or local integrated development environment (IDE).

Note that the Python SDK example code option is available only for SageMaker JumpStart models, and it’s tailored for the specific model use case with input/output data transformation.

Troubleshoot errors

To help troubleshoot and look deeper into model deployment, there are tooltips on the resource Status label to show corresponding error and reason messages. There are also links to Amazon CloudWatch log groups on the endpoint details page. For single-model endpoints, the link to the CloudWatch container logs is conveniently located in the Summary section of the endpoint details. For endpoints with multiple models, the links to the CloudWatch logs are located on each row of the Models table view. The following are some common error scenarios for troubleshooting:

  • Model ping health check failure – The model deployment could fail because the serving container didn’t pass the model ping health check. To debug the issue, refer to the following container logs published by the CloudWatch log groups:
    /aws/sagemaker/Endpoints/[EndpointName]
    /aws/sagemaker/InferenceComponents/[InferenceComponentName]

  • Inconsistent model and endpoint configuration caused deployment failures – If the deployment failed by one of the following error messages, it means the selected model to deploy used a different IAM role, VPC configuration, or network isolation configuration. The remediation is to update the model details to use the same IAM role, VPC configuration, and network isolation configuration during the deployment flow. If you’re adding a model to an existing endpoint, you could recreate the model object to match the target endpoint configurations.
    Model and endpoint config have different execution roles. Please ensure the execution roles are consistent.
    Model and endpoint config have different VPC configurations. Please ensure the VPC configurations are consistent.
    Model and endpoint config have different network isolation configurations. Please ensure the network isolation configurations are consistent.

  • Not enough capacity to deploy more models on the existing endpoint infrastructure – If the deployment failed with the following error message, it means the current endpoint infrastructure doesn’t have enough compute or memory hardware resources to deploy the model. The remediation is to increase the maximum instance count on the endpoint or delete any existing models deployed on the endpoint to make room for new model deployment.
    There is not enough hardware resources on the instances for this endpoint to create a copy of the inference component. Please update resource requirements for this inference component, remove existing inference components, or increase the number of instances for this endpoint.

  • Unsupported instance type for multiple model endpoint deployment – If the deployment failed with the following error message, it means the selected instance type is currently not supported for the multiple model endpoint deployment. The remediation is to change the instance type to an instance that supports this feature and retry the deployment.
    The instance type is not supported for multiple models endpoint. Please choose a different instance type.

For other model deployment issues, refer to Supported features.

Clean up

Cleanup is also straightforward. You can remove one or more models from your existing SageMaker endpoint by selecting the specific model on the SageMaker console. To delete the whole endpoint, navigate to the Endpoints page, select the desired endpoint, choose Delete, and accept the disclaimer to proceed with deletion.

Conclusion

The enhanced interactive experience in SageMaker Studio allows data scientists to focus on model building and bringing their artifacts to SageMaker while abstracting out the complexities of deployment. For those who prefer a code-based approach, check out the low-code equivalent with the ModelBuilder class.

To learn more, visit the SageMaker ModelBuilder Python interface documentation and the guided deploy workflows in SageMaker Studio. There is no additional charge for the SageMaker SDK and SageMaker Studio. You pay only for the underlying resources used. For more information on how to deploy models with SageMaker, see Deploy models for inference.

Special thanks to Sirisha Upadhyayala, Melanie Li, Dhawal Patel, Sam Edwards and Kumara Swami Borra.


About the authors

Raghu Ramesha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Deepak Garg is a Solutions Architect at AWS. He loves diving deep into AWS services and sharing his knowledge with customers. Deepak has background in Content Delivery Networks and Telecommunications

Ram Vegiraju is a ML Architect with the Amazon SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Shiva Raaj Kotini works as a Principal Product Manager in the Amazon SageMaker Inference product portfolio. He focuses on model deployment, performance tuning, and optimization in SageMaker for inference.

Alwin (Qiyun) Zhao is a Senior Software Development Engineer with the Amazon SageMaker Inference Platform team. He is the lead developer of the deployment guardrails and shadow deployments, and he focuses on helping customers to manage ML workloads and deployments at scale with high availability. He also works on platform architecture evolutions for fast and secure ML jobs deployment and running ML online experiments at ease. In his spare time, he enjoys reading, gaming, and traveling.

Gaurav Bhanderi is a Front End engineer with AI platforms team in SageMaker. He works on delivering Customer facing UI solutions within AWS org. In his free time, he enjoys hiking and exploring local restaurants.

Read More

Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 1: PySDK Improvements

Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 1: PySDK Improvements

Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and effortlessly build, train, and deploy machine learning (ML) models at any scale. SageMaker makes it straightforward to deploy models into production directly through API calls to the service. Models are packaged into containers for robust and scalable deployments. Although it provides various entry points like the SageMaker Python SDK, AWS SDKs, the SageMaker console, and Amazon SageMaker Studio notebooks to simplify the process of training and deploying ML models at scale, customers are still looking for better ways to deploy their models for playground testing and to optimize production deployments.

We are launching two new ways to simplify the process of packaging and deploying models using SageMaker.

In this post, we introduce the new SageMaker Python SDK ModelBuilder experience, which aims to minimize the learning curve for new SageMaker users like data scientists, while also helping experienced MLOps engineers maximize utilization of SageMaker hosting services. It reduces the complexity of initial setup and deployment, and by providing guidance on best practices for taking advantage of the full capabilities of SageMaker. We provide detailed information and GitHub examples for this new SageMaker capability.

The other new launch is to use the new interactive deployment experience in SageMaker Studio. We discuss this in Part 2.

Deploying models to a SageMaker endpoint entails a series of steps to get the model ready to be hosted on a SageMaker endpoint. This involves getting the model artifacts in the correct format and structure, creating inference code, and specifying essential details like the model image URL, Amazon Simple Storage Service (Amazon S3) location of model artifacts, serialization and deserialization steps, and necessary AWS Identity and Access Management (IAM) roles to facilitate appropriate access permissions. Following this, an endpoint configuration requires determining the inference type and configuring respective parameters such as instance types, counts, and traffic distribution among model variants.

To further help our customers when using SageMaker hosting, we introduced the new ModelBuilder class in the SageMaker Python SDK, which brings the following key benefits when deploying models to SageMaker endpoints:

  • Unifies the deployment experience across frameworks – The new experience provides a consistent workflow for deploying models built using different frameworks like PyTorch, TensorFlow, and XGBoost. This simplifies the deployment process.
  • Automates model deployment – Tasks like selecting appropriate containers, capturing dependencies, and handling serialization/deserialization are automated, reducing manual effort required for deployment.
  • Provides a smooth transition from local to SageMaker hosted endpoint – With minimal code changes, models can be easily transitioned from local testing to deployment on a SageMaker endpoint. Live logs make debugging seamless.

Overall, SageMaker ModelBuilder simplifies and streamlines the model packaging process for SageMaker inference by handling low-level details and provides tools for testing, validation, and optimization of endpoints. This improves developer productivity and reduces errors.

In the following sections, we deep dive into the details of this new feature. We also discuss how to deploy models to SageMaker hosting using ModelBuilder, which simplifies the process. Then we walk you through a few examples for different frameworks to deploy both traditional ML models and the foundation models that power generative AI use cases.

Getting to know SageMaker ModelBuilder

The new ModelBuilder is a Python class focused on taking ML models built using frameworks, like XGBoost or PyTorch, and converting them into models that are ready for deployment on SageMaker. ModelBuilder provides a build() function, which generates the artifacts according the model server, and a deploy() function to deploy locally or to a SageMaker endpoint. The introduction of this feature simplifies the integration of models with the SageMaker environment, optimizing them for performance and scalability. The following diagram shows how ModelBuilder works on a high-level.

ModelBuilder class

The ModelBuilder class provide different options for customization. However, to deploy the framework model, the model builder just expects the model, input, output, and role:

class ModelBuilder(
    model, # model id or model object
    role_arn, # IAM role
    schema_builder, # defines the input and output
    mode, # select between local deployment and depoy to SageMaker Endpoints
    ...
)

SchemaBuilder

The SchemaBuilder class enables you to define the input and output for your endpoint. It allows the schema builder to generate the corresponding marshaling functions for serializing and deserializing the input and output. The following class file provides all the options for customization:

class SchemaBuilder(
    sample_input: Any,
    sample_output: Any,
    input_translator: CustomPayloadTranslator = None,
    output_translator: CustomPayloadTranslator = None
)

However, in most cases, just sample input and output would work. For example:

input = "How is the demo going?"
output = "Comment la démo va-t-elle?"
schema = SchemaBuilder(input, output)

By providing sample input and output, SchemaBuilder can automatically determine the necessary transformations, making the integration process more straightforward. For more advanced use cases, there’s flexibility to provide custom translation functions for both input and output, ensuring that more complex data structures can also be handled efficiently. We demonstrate this in the following sections by deploying different models with various frameworks using ModelBuilder.

Local mode experience

In this example, we use ModelBuilder to deploy XGBoost model locally. You can use Mode to switch between local testing and deploying to a SageMaker endpoint. We first train the XGBoost model (locally or in SageMaker) and store the model artifacts in the working directory:

# Train the model
model = XGBClassifier()
model.fit(X_train, y_train)
model.save_model(model_dir + "/my_model.xgb")

Then we create a ModelBuilder object by passing the actual model object, the SchemaBuilder that uses the sample test input and output objects (the same input and output we used when training and testing the model) to infer the serialization needed. Note that we use Mode.LOCAL_CONTAINER to specify a local deployment. After that, we call the build function to automatically identify the supported framework container image as well as scan for dependencies. See the following code:

model_builder_local = ModelBuilder(
    model=model,  
    schema_builder=SchemaBuilder(X_test, y_pred), 
    role_arn=execution_role, 
    mode=Mode.LOCAL_CONTAINER
)
xgb_local_builder = model_builder_local.build()

Finally, we can call the deploy function in the model object, which also provides live logging for easier debugging. You don’t need to specify the instance type or count because the model will be deployed locally. If you provided these parameters, they will be ignored. This function will return the predictor object that we can use to make prediction with the test data:

# note: all the serialization and deserialization is handled by the model builder.
predictor_local = xgb_local_builder.deploy(
# instance_type='ml.c5.xlarge',
# initial_instance_count=1
)

# Make prediction for test data. 
predictor_local.predict(X_test)

Optionally, you can also control the loading of the model and preprocessing and postprocessing using InferenceSpec. We provide more details later in this post. Using LOCAL_CONTAINER is a great way to test out your script locally before deploying to a SageMaker endpoint.

Refer to the model-builder-xgboost.ipynb example to test out deploying both locally and to a SageMaker endpoint using ModelBuilder.

Deploy traditional models to SageMaker endpoints

In the following examples, we showcase how to use ModelBuilder to deploy traditional ML models.

XGBoost models

Similar to the previous section, you can deploy an XGBoost model to a SageMaker endpoint by changing the mode parameter when creating the ModelBuilder object:

model_builder = ModelBuilder(
    model=model,  
    schema_builder=SchemaBuilder(sample_input=sample_input, sample_output=sample_output), 
    role_arn=execution_role, 
    mode=Mode.SAGEMAKER_ENDPOINT
)
xgb_builder = model_builder.build()
predictor = xgb_builder.deploy(
    instance_type='ml.c5.xlarge',
    initial_instance_count=1
)

Note that when deploying to SageMaker endpoints, you need to specify the instance type and instance count when calling the deploy function.

Refer to the model-builder-xgboost.ipynb example to deploy an XGBoost model.

Triton models

You can use ModelBuilder to serve PyTorch models on Triton Inference Server. For that, you need to specify the model_server parameter as ModelServer.TRITON, pass a model, and have a SchemaBuilder object, which requires sample inputs and outputs from the model. ModelBuilder will take care of the rest for you.

model_builder = ModelBuilder(
    model=model,  
    schema_builder=SchemaBuilder(sample_input=sample_input, sample_output=sample_output), 
    role_arn=execution_role,
    model_server=ModelServer.TRITON, 
    mode=Mode.SAGEMAKER_ENDPOINT
)

triton_builder = model_builder.build()

predictor = triton_builder.deploy(
    instance_type='ml.g4dn.xlarge',
    initial_instance_count=1
)

Refer to model-builder-triton.ipynb to deploy a model with Triton.

Hugging Face models

In this example, we show you how to deploy a pre-trained transformer model provided by Hugging Face to SageMaker. We want to use the Hugging Face pipeline to load the model, so we create a custom inference spec for ModelBuilder:

# custom inference spec with hugging face pipeline
class MyInferenceSpec(InferenceSpec):
    def load(self, model_dir: str):
        return pipeline("translation_en_to_fr", model="t5-small")
        
    def invoke(self, input, model):
        return model(input)
    
inf_spec = MyInferenceSpec()

We also define the input and output of the inference workload by defining the SchemaBuilder object based on the model input and output:

schema = SchemaBuilder(value,output)

Then we create the ModelBuilder object and deploy the model onto a SageMaker endpoint following the same logic as shown in the other example:

builder = ModelBuilder(
    inference_spec=inf_spec,
    mode=Mode.SAGEMAKER_ENDPOINT,  # you can change it to Mode.LOCAL_CONTAINER for local testing
    schema_builder=schema,
    image_uri=image,
)
model = builder.build(
    role_arn=execution_role,
    sagemaker_session=sagemaker_session,
)
predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.g5.2xlarge'
)

Refer to model-builder-huggingface.ipynb to deploy a Hugging Face pipeline model.

Deploy foundation models to SageMaker endpoints

In the following examples, we showcase how to use ModelBuilder to deploy foundation models. Just like the models mentioned earlier, all that is required is the model ID.

Hugging Face Hub

If you want to deploy a foundation model from Hugging Face Hub, all you need to do is pass the pre-trained model ID. For example, the following code snippet deploys the meta-llama/Llama-2-7b-hf model locally. You can change the mode to Mode.SAGEMAKER_ENDPOINT to deploy to SageMaker endpoints.

model_builder = ModelBuilder(
    model="meta-llama/Llama-2-7b-hf",
    schema_builder=SchemaBuilder(sample_input, sample_output),
    model_path="/home/ec2-user/SageMaker/LoadTestResources/meta-llama2-7b", #local path where artifacts will be saved
    mode=Mode.LOCAL_CONTAINER,
    env_vars={
        # Llama 2 is a gated model and requires a Hugging Face Hub token.
        "HUGGING_FACE_HUB_TOKEN": "<YourHuggingFaceToken>"
 
    }
)
model = model_builder.build()
local_predictor = model.deploy()

For gated models on Hugging Face Hub, you need to request access via Hugging Face Hub and use the associated key by passing it as the environment variable HUGGING_FACE_HUB_TOKEN. Some Hugging Face models may require trusting remote code. It can be set as an environment variable as well using HF_TRUST_REMOTE_CODE. By default, ModelBuilder will use a Hugging Face Text Generation Inference (TGI) container as the underlying container for Hugging Face models. If you would like to use AWS Large Model Inference (LMI) containers, you can set up the model_server parameter as ModelServer.DJL_SERVING when you configure the ModelBuilder object.

A neat feature of ModelBuilder is the ability to run local tuning of the container parameters when you use LOCAL_CONTAINER mode. This feature can be used by simply running tuned_model = model.tune().

Refer to demo-model-builder-huggingface-llama2.ipynb to deploy a Hugging Face Hub model.

SageMaker JumpStart

Amazon SageMaker JumpStart also offers a number of pre-trained foundation models. Just like the process of deploying a model from Hugging Face Hub, the model ID is required. Deploying a SageMaker JumpStart model to a SageMaker endpoint is as straightforward as running the following code:

model_builder = ModelBuilder(
    model="huggingface-llm-falcon-7b-bf16",
    schema_builder=SchemaBuilder(sample_input, sample_output),
    role_arn=execution_role
)

sm_ep_model = model_builder.build()

predictor = sm_ep_model.deploy()

For all available SageMaker JumpStart model IDs, refer to Built-in Algorithms with pre-trained Model Table. Refer to model-builder-jumpstart-falcon.ipynb to deploy a SageMaker JumpStart model.

Inference component

ModelBulder allows you to use the new inference component capability in SageMaker to deploy models. For more information on inference components, see Reduce Model Deployment Costs By 50% on Average Using SageMaker’s Latest Features. You can use inference components for deployment with ModelBuilder by specifying endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED in the deploy() method. You can also use the tune() method, which fetches the optimal number of accelerators, and modify it if required.

resource_requirements = ResourceRequirements(
    requests={
        "num_accelerators": 4,
        "memory": 1024,
        "copies": 1,
    },
    limits={},
)

goldfinch_predictor_2 = model_2.deploy(
    mode=Mode.SAGEMAKER_ENDPOINT,
    endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED,
    ...
	
)

Refer to model-builder-inference-component.ipynb to deploy a model as an inference component.

Customize the ModelBuilder Class

The ModelBuilder class allows you to customize model loading using InferenceSpec.

In addition, you can control payload and response serialization and deserialization and customize preprocessing and postprocessing using CustomPayloadTranslator. Additionally, when you need to extend our pre-built containers for model deployment on SageMaker, you can use ModelBuilder to handle the model packaging process. In this following section, we provide more details of these capabilities.

InferenceSpec

InferenceSpec offers an additional layer of customization. It allows you to define how the model is loaded and how it will handle incoming inference requests. Through InferenceSpec, you can define custom loading procedures for your models, bypassing the default loading mechanisms. This flexibility is particularly beneficial when working with non-standard models or custom inference pipelines. The invoke method can be customized, providing you with the ability to tailor how the model processes incoming requests (preprocessing and postprocessing). This customization can be essential to ensure that the inference process aligns with the specific needs of the model. See the following code:

class InferenceSpec(abc.ABC):
    @abc.abstractmethod
    def load(self, model_dir: str):
        pass

    @abc.abstractmethod
    def invoke(self, input_object: object, model: object):
        pass

The following code shows an example of using this class:

class MyInferenceSpec(InferenceSpec):
    def load(self, model_dir: str):
        return // model object

    def invoke(self, input, model):
        return model(input)

CustomPayloadTranslator

When invoking SageMaker endpoints, the data is sent through HTTP payloads with different MIME types. For example, an image sent to the endpoint for inference needs to be converted to bytes at the client side and sent through the HTTP payload to the endpoint. When the endpoint receives the payload, it needs to deserialize the byte string back to the data type that is expected by the model (also known as server-side deserialization). After the model finishes prediction, the results need to be serialized to bytes that can be sent back through the HTTP payload to the user or client. When the client receives the response byte data, it needs to perform client-side deserialization to convert the bytes data back to the expected data format, such as JSON. At a minimum, you need to convert the data for the following (as numbered in the following diagram):

  1. Inference request serialization (handled by the client)
  2. Inference request deserialization (handled by the server or algorithm)
  3. Invoking the model against the payload
  4. Sending response payload back
  5. Inference response serialization (handled by the server or algorithm)
  6. Inference response deserialization (handled by the client)

The following diagram shows the process of serialization and deserialization during the invocation process.

In the following code snippet, we show an example of CustomPayloadTranslator when additional customization is needed to handle both serialization and deserialization in the client and server side, respectively:

from sagemaker.serve import CustomPayloadTranslator

# request translator
class MyRequestTranslator(CustomPayloadTranslator):
    # This function converts the payload to bytes - happens on client side
    def serialize_payload_to_bytes(self, payload: object) -> bytes:
        # converts the input payload to bytes
        ... ...
        return  //return object as bytes
        
    # This function converts the bytes to payload - happens on server side
    def deserialize_payload_from_stream(self, stream) -> object:
        # convert bytes to in-memory object
        ... ...
        return //return in-memory object
        
# response translator 
class MyResponseTranslator(CustomPayloadTranslator):
    # This function converts the payload to bytes - happens on server side
    def serialize_payload_to_bytes(self, payload: object) -> bytes:
        # converts the response payload to bytes
        ... ...
        return //return object as bytes
    
    # This function converts the bytes to payload - happens on client side
    def deserialize_payload_from_stream(self, stream) -> object:
        # convert bytes to in-memory object
        ... ...
        return //return in-memory object

In the demo-model-builder-pytorch.ipynb notebook, we demonstrate how to easily deploy a PyTorch model to a SageMaker endpoint using ModelBuilder with the CustomPayloadTranslator and the InferenceSpec class.

Stage model for deployment

If you want to stage the model for inference or in the model registry, you can use model.create() or model.register(). The enabled model is created on the service, and then you can deploy later. See the following code:

model_builder = ModelBuilder(
    model=model,  
    schema_builder=SchemaBuilder(X_test, y_pred), 
    role_arn=execution_role, 
)
deployable_model = model_builder.build()

deployable_model.create() # deployable_model.register() for model registry

Use custom containers

SageMaker provides pre-built Docker images for its built-in algorithms and the supported deep learning frameworks used for training and inference. If a pre-built SageMaker container doesn’t fulfill all your requirements, you can extend the existing image to accommodate your needs. By extending a pre-built image, you can use the included deep learning libraries and settings without having to create an image from scratch. For more details about how to extend the pre-built containers, refer to SageMaker document. ModelBuilder supports use cases when bringing your own containers that are extended from our pre-built Docker containers.

To use your own container image in this case, you need to set the fields image_uri and model_server when defining ModelBuilder:

model_builder = ModelBuilder(
    model=model,  # Pass in the actual model object. its "predict" method will be invoked in the endpoint.
    schema_builder=SchemaBuilder(X_test, y_pred), # Pass in a "SchemaBuilder" which will use the sample test input and output objects to infer the serialization needed.
    role_arn=execution_role, 
    image_uri=image_uri, # REQUIRED FOR BYOC: Passing in image hosted in personal ECR Repo
    model_server=ModelServer.TORCHSERVE, # REQUIRED FOR BYOC: Passing in model server of choice
    mode=Mode.SAGEMAKER_ENDPOINT,
    dependencies={"auto": True, "custom": ["protobuf==3.20.2"]}
)

Here, the image_uri will be the container image ARN that is stored in your account’s Amazon Elastic Container Registry (Amazon ECR) repository. One example is shown as follows:

# Pulled the xgboost:1.7-1 DLC and pushed to personal ECR repo
image_uri = "<your_account_id>.dkr.ecr.us-west-2.amazonaws.com/my-byoc:xgb"

When the image_uri is set, during the ModelBuilder build process, it will skip auto detection of the image as the image URI is provided. If model_server is not set in ModelBuilder, you will receive a validation error message, for example:

ValueError: Model_server must be set when image_uri is set. Supported model servers: {<ModelServer.TRITON: 5>, <ModelServer.DJL_SERVING: 4>, <ModelServer.TORCHSERVE: 1>}

As of the publication of this post, ModelBuilder supports bringing your own containers that are extended from our pre-built DLC container images or containers built with the model servers like Deep Java Library (DJL), Text Generation Inference (TGI), TorchServe, and Triton inference server.

Custom dependencies

When running ModelBuilder.build(), by default it automatically captures your Python environment into a requirements.txt file and installs the same dependency in the container. However, sometimes your local Python environment will conflict with the environment in the container. ModelBuilder provides a simple way for you to modify the captured dependencies to fix such dependency conflicts by allowing you to provide your custom configurations into ModelBuilder. Note that this is only for TorchServe and Triton with InferenceSpec. For example, you can specify the input parameter dependencies, which is a Python dictionary, in ModelBuilder as follows:

dependency_config = {
   "auto" = True,
   "requirements" = "/path/to/your/requirements.txt"
   "custom" = ["module>=1.2.3,<1.5", "boto3==1.16.*", "some_module@http://some/url"]
}
  
ModelBuilder(
    # Other params
    dependencies=dependency_config,
).build()

We define the following fields:

  • auto – Whether to try to auto capture the dependencies in your environment.
  • requirements – A string of the path to your own requirements.txt file. (This is optional.)
  • custom – A list of any other custom dependencies that you want to add or modify. (This is optional.)

If the same module is specified in multiple places, custom will have highest priority, then requirements, and auto will have lowest priority. For example, let’s say that during autodetect, ModelBuilder detects numpy==1.25, and a requirements.txt file is provided that specifies numpy>=1.24,<1.26. Additionally, there is a custom dependency: custom = ["numpy==1.26.1"]. In this case, numpy==1.26.1 will be picked when we install dependencies in the container.

Clean up

When you’re done testing the models, as a best practice, delete the endpoint to save costs if the endpoint is no longer required. You can follow the Clean up section in each of the demo notebooks or use following code to delete the model and endpoint created by the demo:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

The new SageMaker ModelBuilder capability simplifies the process of deploying ML models into production on SageMaker. By handling many of the complex details behind the scenes, ModelBuilder reduces the learning curve for new users and maximizes utilization for experienced users. With just a few lines of code, you can deploy models with built-in frameworks like XGBoost, PyTorch, Triton, and Hugging Face, as well as models provided by SageMaker JumpStart into robust, scalable endpoints on SageMaker.

We encourage all SageMaker users to try out this new capability by referring to the ModelBuilder documentation page. ModelBuilder is available now to all SageMaker users at no additional charge. Take advantage of this simplified workflow to get your models deployed faster. We look forward to hearing how ModelBuilder accelerates your model development lifecycle!

Special thanks to Sirisha Upadhyayala, Raymond Liu, Gary Wang, Dhawal Patel, Deepak Garg and Ram Vegiraju.


About the authors

Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers build solutions using state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing ML solutions with best practices. In her spare time, she loves to explore nature and spend time with family and friends.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Sam Edwards, is a Cloud Engineer (AI/ML) at AWS Sydney specialized in machine learning and Amazon SageMaker. He is passionate about helping customers solve issues related to machine learning workflows and creating new solutions for them. Outside of work, he enjoys playing racquet sports and traveling.

Raghu Ramesha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Shiva Raaj Kotini works as a Principal Product Manager in the Amazon SageMaker inference product portfolio. He focuses on model deployment, performance tuning, and optimization in SageMaker for inference.

Mohan Gandhi is a Senior Software Engineer at AWS. He has been with AWS for the last 10 years and has worked on various AWS services like EMR, EFA and RDS. Currently, he is focused on improving the SageMaker Inference Experience. In his spare time, he enjoys hiking and marathons.

Read More

New – Code Editor, based on Code-OSS VS Code Open Source now available in Amazon SageMaker Studio

New – Code Editor, based on Code-OSS VS Code Open Source now available in Amazon SageMaker Studio

Today, we are excited to announce support for Code Editor, a new integrated development environment (IDE) option in Amazon SageMaker Studio. Code Editor is based on Code-OSS, Visual Studio Code Open Source, and provides access to the familiar environment and tools of the popular IDE that machine learning (ML) developers know and love, fully integrated with the broader SageMaker Studio feature set. Code Editor enables you to choose from thousands of VS Code compatible extensions available in the Open-VSX extension gallery to further enhance your teams’ development experience. You can also maximize your team’s productivity by using seamless integration to AWS services through the AWS Toolkit for Visual Studio Code, including the AWS AI-powered coding companion, Amazon CodeWhisperer.

As with all IDE applications in SageMaker Studio, ML developers and engineers can select the underlying compute on demand, and swap it based on their needs without losing data. Additionally, your teams can manage their codebase version control and collaborate across teams through native GitHub integration and reduce time-to-coding by using the most popular ML frameworks right out of the box with the pre-configured Amazon SageMaker Distribution container image.

Getting started with Code Editor on Amazon SageMaker Studio

Your IT administrator can setup a new SageMaker Studio domain or migrate an existing one to the new SageMaker Studio experience, which includes Code Editor. See Onboard to Amazon SageMaker Domain using Quick setup for more details. You can then launch Code Editor with a simple click in your Amazon SageMaker Studio environment.

  1. After the domain is setup, launch SageMaker Studio’s new experience from the console or the pre-signed URL your administrator provided. You can find the Code Editor IDE in both the Applications section in the left-side panel and the Overview section, as shown in the following screenshot:
  2. On the Code Editor details page, choose Create Code Editor Space. Then enter a name for your space and choose Create space:
  3. On the Code Editor Space details page, choose your underlying configuration, including:
    1. The underlying Amazon Elastic Compute Cloud (Amazon EC2) instance type.
    2. An Amazon Elastic Block Storage (Amazon EBS) volume size (this can range from 5GB to 16TB).
    3. The container image to use (you will have a SageMaker Distribution image for both CPU and GPU available at launch).
    4. A lifecycle configuration script to run in case you want customize your environment at app creation.
    5. A shared Amazon Elastic File System (Amazon EFS) to mount in your Code Editor space (this needs to be configured by your administrator when provisioning your domain).
  4. After providing your space configuration details, choose Run space to provision your space resources.

If you have chosen a fast-launch instance with the default SageMaker Distribution as image, your Code Editor space will be available in less than a minute. If you have added lifecycle configurations to the space, then it might take additional time to install dependencies from that script.

After your resources are provisioned, the space details page will show an Open Code Editor button.

  1. Choose Open CodeEditor to launch the IDE.

The code editor IDE will launch in a new browser tab.

Code Editor features

Code Editor comes with a unique set of features to increase the productivity of your ML team:

  1. Fully managed infrastructure – The Code Editor IDE runs on fully managed infrastructure. Amazon SageMaker takes care of keeping the instances up-to date with the latest security patches and upgrades.
  2. Dial resources up and down – With Code Editor, you can seamlessly change the underlying resources (e.g., instance type, EBS volume size) on which Code Editor is running. This is beneficial for developers who want to run workloads with changing compute, memory and storage needs.
  3. SageMaker provided images – Code Editor is preconfigured with the SageMaker Distribution as the default image. This container image has all the most popular ML frameworks supported by SageMaker, along with SageMaker Python SDK, boto3, and other AWS and data science specific libraries installed. This significantly reduces the time you spend setting up your environment and decreases the complexity of managing package dependencies in your ML project.
  4. Amazon CodeWhisperer integration – Code Editor also comes with generative AI capabilities powered by Amazon CodeWhisperer. This native integration enables you to boost your productivity by generating code suggestions within the IDE.
  5. Integration with other AWS services – You get native integration with Amazon Simple Storage Service (S3) buckets, Amazon Elastic Container Registry (ECR) repositories, Amazon RedShift, Amazon CloudWatch, and more via the AWS Toolkit for VS Code which simplifies development in cloud.

Architecture details

When launching Code Editor in SageMaker Studio, you’re creating a new application that runs as a container in an EC2 instance of the type you selected when configuring your Code Editor space. SageMaker Studio handles the provisioning of underlying resources for you in a service managed account. The following diagram depicts a simplified version of the Code Editor IDE application architecture:

For a given user profile, you can launch multiple Code Editor spaces, with a variety of ML instance types (including accelerated computing instances). Each space defines the attached EBS volume size, the instance type and the type of application to run in the space (for example, Code Editor). When users run the space, the underlying EC2 instance is provisioned and a SageMaker Studio Code Editor app is instantiated, based on the selected container image. The EBS volume is persisted across Start/Stop cycles of the IDE app. If users stop the Code Editor app (for example, to save on compute costs), the compute resources are stopped but the EBS volume is preserved and re-attached to the instance at restart.

All Code Editor applications run isolated; if you need to share data across applications, you can attach a shared Amazon Elastic File System (EFS) drive.

In order for your Code Editor IDE to use the pre-installed AWS Toolkit extension for VS Code and use integrated AWS services such as Amazon CodeWhisperer or data sources such as Amazon S3 and Amazon Redshift you have to make sure that:

  • Your SageMaker Studio user profile’s execution role has appropriate permissions to use the services you want to work with.
  • You have a way to communicate to those services in case you have a VPC-only mode SageMaker Studio domain. For more details on the requirements to use AWS services in a VPC-only mode Studio domain, refer to Connect SageMaker Studio Notebooks in a VPC to External Resources.

Solution overview

In the following sections, we share how you can develop an example ML project with Code Editor on Amazon SageMaker Studio. We will deploy a Mistral-7B large language model (LLM) model into an Amazon SageMaker real-time endpoint using a built-in container from HuggingFace. In this example, Code Editor can be used by an ML engineering team who needs advanced IDE features to debug their code and deploy the endpoint. You can find the sample code in this GitHub repo. We show how you can structure your code for easy collaboration between team members, how you can use the AWS Toolkit for VS Code and Amazon Code Whisperer to speed up your development, and how to deploy the Mistral-7B model on a SageMaker endpoint. Let’s walk through some of the common developer tasks in the IDE.

Interacting with AWS services directly from your IDE

Out of the box, Code Editor comes with the AWS Toolkit for Visual Studio Code to provide you with an integrated experience to other AWS services during your project. Based on your SageMaker Studio user profile AWS Identity and Access Management (IAM) permission, you can interact with data in your Amazon S3 buckets, find container images in Amazon ECR, visualize Amazon CloudWatch logs for your SageMaker endpoint, and take advantage of other features to run your end-to-end ML project from your IDE.

Structure your code repository for effortless collaboration

You can structure your project repository to maximize the productivity of your team. For example, you can setup a single repository, aiming to strike a balance between common Python project conventions and your team collaboration needs.

Your code repository can contain a .vscode folder with all the necessary files for standardizing dependencies, extensions, and configurations across the different team members. Refer to the following animation for reference.

You can share dependencies across team members through a requirements.txt file. You can also specify a config.yaml file to share the launch primitives for your SageMaker endpoint. Your Code Editor session will share the same dependencies and config as your team members, and allow you to quickly develop and debug you inference code and endpoint

Develop and debug your code in the IDE

In the following example, we show how you can develop and debug your inference.py script that will be used in your SageMaker endpoint:

Generate code and test cases with Amazon CodeWhisperer

As part of the AWS Toolkit in your Code Editor, Amazon CodeWhisperer allows you to build faster and more securely with an AI coding companion. It can provide you with real-time code suggestions, is optimized for use with AWS services, and comes with built-in security scanning. In our example we use Amazon CodeWhisperer to generate whole line and full function code to deploy and test your SageMaker endpoint

Deploying your LLM into a SageMaker endpoint

You can deploy your model to a SageMaker endpoint from your IDE and monitor its status directly from SageMaker Studio.

As you scale your ML project into a production-ready application, Code Editor and the AWS Toolkit will allow you to manage and monitor your LLM application’s resources as you build, deploy, and run it.

Conclusion

Code Editor is available in all AWS Regions where Amazon SageMaker Studio is available (except GovCloud), and you only pay for the underlying compute and storage resources within SageMaker or other AWS services, based on your usage.

To get started with Code Editor on Amazon SageMaker Studio, you can use the AWS Free Tier, with 250 hours of ml.t3.medium instance on Amazon SageMaker Studio per month for the first 2 months. For more details, refer to Amazon SageMaker Pricing.


About the Authors

Eric Peña is a Senior Technical Product Manager in the AWS Artificial Intelligence Platforms team, working on Amazon SageMaker Interactive Machine Learning. He currently focuses on IDE integrations on SageMaker Studio. He holds an MBA degree from MIT Sloan and outside of work enjoys playing basketball and football.

Vikesh Pandey is a Machine Learning Specialist Solutions Architect at AWS, helping customers from financial industries design and build solutions on generative AI and ML. Outside of work, Vikesh enjoys trying out different cuisines and playing outdoor sports.

Bruno Pistone is an AI/ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His expertise include: Machine Learning end to end, Machine Learning Industrialization, and Generative AI. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Sofian Hamiti is an AI/ML specialist Solutions Architect at AWS. He helps customers across industries accelerate their AI/ML journey by helping them build and operationalize end-to-end machine learning solutions.

Read More

Scale foundation model inference to hundreds of models with Amazon SageMaker – Part 1

Scale foundation model inference to hundreds of models with Amazon SageMaker – Part 1

As democratization of foundation models (FMs) becomes more prevalent and demand for AI-augmented services increases, software as a service (SaaS) providers are looking to use machine learning (ML) platforms that support multiple tenants—for data scientists internal to their organization and external customers. More and more companies are realizing the value of using FMs to generate highly personalized and effective content for their customers. Fine-tuning FMs on your own data can significantly boost model accuracy for your specific use case, whether it be sales email generation using page visit context, generating search answers tailored to a company’s services, or automating customer support by training on historical conversations.

Providing generative AI model hosting as a service enables any organization to easily integrate, pilot test, and deploy FMs at scale in a cost-effective manner, without needing in-house AI expertise. This allows companies to experiment with AI use cases like hyper-personalized sales and marketing content, intelligent search, and customized customer service workflows. By using hosted generative models fine-tuned on trusted customer data, businesses can deliver the next level of personalized and effective AI applications to better engage and serve their customers.

Amazon SageMaker offers different ML inference options, including real-time, asynchronous, and batch transform. This post focuses on providing prescriptive guidance on hosting FMs cost-effectively at scale. Specifically, we discuss the quick and responsive world of real-time inference, exploring different options for real-time inference for FMs.

For inference, multi-tenant AI/ML architectures need to consider the requirements for data and models, as well as the compute resources that are required to perform inference from these models. It’s important to consider how multi-tenant AI/ML models are deployed—ideally, in order to optimally utilize CPUs and GPUs, you have to be able to architect an inferencing solution that can enhance serving throughput and reduce cost by ensuring that models are distributed across the compute infrastructure in an efficient manner. In addition, customers are looking for solutions that help them deploy a best-practice inferencing architecture without needing to build everything from scratch.

SageMaker Inference is a fully managed ML hosting service. It supports building generative AI applications while meeting regulatory standards like FedRAMP. SageMaker enables cost-efficient scaling for high-throughput inference workloads. It supports diverse workloads including real-time, asynchronous, and batch inferences on hardware like AWS Inferentia, AWS Graviton, NVIDIA GPUs, and Intel CPUs. SageMaker gives you full control over optimizations, workload isolation, and containerization. It enables you to build generative AI as a service solution at scale with support for multi-model and multi-container deployments.

Challenges of hosting foundation models at scale

The following are some of the challenges in hosting FMs for inference at scale:

  • Large memory footprint – FMs with tens or hundreds of billions of model parameters often exceed the memory capacity of a single accelerator chip.
  • Transformers are slow – Autoregressive decoding in FMs, especially with long input and output sequences, exacerbates memory I/O operations. This culminates in unacceptable latency periods, adversely affecting real-time inference.
  • Cost – FMs necessitate ML accelerators that provide both high memory and high computational power. Achieving high throughput and low latency without sacrificing either is a specialized task, requiring a deep understanding of hardware-software acceleration co-optimization.
  • Longer time-to-market – Optimal performance from FMs demands rigorous tuning. This specialized tuning process, coupled with the complexities of infrastructure management, results in elongated time-to-market cycles.
  • Workload isolation – Hosting FMs at scale introduces challenges in minimizing the blast-radius and handling noisy neighbors. The ability to scale each FM in response to model-specific traffic patterns requires heavy lifting.
  • Scaling to hundreds of FMs – Operating hundreds of FMs simultaneously introduces substantial operational overhead. Effective endpoint management, appropriate slicing and accelerator allocation, and model-specific scaling are tasks that compound in complexity as more models are deployed.

Fitness functions

Deciding on the right hosting option is important because it impacts the end-users rendered by your applications. For this purpose, we’re borrowing the concept of fitness functions, which was coined by Neal Ford and his colleagues from AWS Partner Thought Works in their work Building Evolutionary Architectures. Fitness functions provide a prescriptive assessment of various hosting options based on your objectives. Fitness functions help you obtain the necessary data to allow for the planned evolution of your architecture. They set measurable values to assess how close your solution is to achieving your set goals. Fitness functions can and should be adapted as the architecture evolves to guide a desired change process. This provides architects with a tool to guide their teams while maintaining team autonomy.

We propose considering the following fitness functions when it comes to selecting the right FM inference option at scale and cost-effectively:

  • Foundation model size – FMs are based on transformers. Transformers are slow and memory-hungry on generating long text sequences due to the sheer size of the models. Large language models (LLMs) are a type of FM that, when used to generate text sequences, need immense amounts of computing power and have difficulty accessing the available high bandwidth memory (HBM) and compute capacity. This is because a large portion of the available memory bandwidth is consumed by loading the model’s parameters and by the auto-regressive decoding process. As a result, even with massive amounts of compute power, FMs are limited by memory I/O and computation limits. Therefore, model size determines a lot of decisions, such as whether the model will fit on a single accelerator or require multiple ML accelerators using model sharding on the instance to run the inference at a higher throughput. Models with more than 3 billion parameters will generally start requiring multiple ML accelerators because the model might not fit into a single accelerator device.
  • Performance and FM inference latency – Many ML models and applications are latency critical, in which the inference latency must be within the bounds specified by a service-level objective. FM inference latency depends on a multitude of factors, including:
    • FM model size – Model size, including quantization at runtime.
    • Hardware – Compute (TFLOPS), HBM size and bandwidth, network bandwidth, intra-instance interconnect speed, and storage bandwidth.
    • Software environment – Model server, model parallel library, model optimization engine, collective communication performance, model network architecture, quantization, and ML framework.
    • Prompt – Input and output length and hyperparameters.
    • Scaling latency – Time to scale in response to traffic.
    • Cold start latency – Features like pre-warming the model load can reduce the cold start latency in loading the FM.
  • Workload isolation – This refers to workload isolation requirements from a regulatory and compliance perspective, including protecting confidentiality and integrity of AI models and algorithms, confidentiality of data during AI inference, and protecting AI intellectual property (IP) from unauthorized access or from a risk management perspective. For example, you can reduce the impact of a security event by purposefully reducing the blast-radius or by preventing noisy neighbors.
  • Cost-efficiency – To deploy and maintain an FM model and ML application on a scalable framework is a critical business process, and the costs may vary greatly depending on choices made about model hosting infrastructure, hosting option, ML frameworks, ML model characteristics, optimizations, scaling policy, and more. The workloads must utilize the hardware infrastructure optimally to ensure that the cost remains in check. This fitness function specifically refers to the infrastructure cost, which is part of the overall total cost of ownership (TCO). The infrastructure costs are the combined costs for storage, network, and compute. It’s also critical to understand other components of TCO, including operational costs and security and compliance costs. Operational costs are the combined costs of operating, monitoring, and maintaining the ML infrastructure. The operational costs are calculated as the number of engineers required based on each scenario and the annual salary of engineers, aggregated over a specific period. They automatically scale to zero per model when there’s no traffic to save costs.
  • Scalability – This includes:
    • Operational overhead in managing hundreds of FMs for inference in a multi-tenant platform.
    • The ability to pack multiple FMs in a single endpoint and scale per model.
    • Enabling instance-level and model container-level scaling based on workload patterns.
    • Support for scaling to hundreds of FMs per endpoint.
    • Support for the initial placement of the models in the fleet and handling insufficient accelerators.

Representing the dimensions in fitness functions

We use a spider chart, also sometimes called a radar chart, to represent the dimensions in the fitness functions. A spider chart is often used when you want to display data across several unique dimensions. These dimensions are usually quantitative, and typically range from zero to a maximum value. Each dimension’s range is normalized to one another, so that when we draw our spider chart, the length of a line from zero to a dimension’s maximum value will be the same for every dimension.

The following chart illustrates the decision-making process involved when choosing your architecture on SageMaker. Each radius on the spider chart is one of the fitness functions that you will prioritize when you build your inference solution.

Ideally, you’d like a shape that is equilateral across all sides (a pentagon). That shows that you are able to optimize across all fitness functions. But the reality is that it will be challenging to achieve that shape—as you prioritize one fitness function, it will affect the lines for the other radius. This means there will always be trade-offs depending on what is most important for your generative AI application, and you’ll have a graph that will be skewed towards a specific radius. This is the criteria that you may be willing to de-prioritize in favor of the others depending on how you view each function. In our chart, each fitness function’s metric weight is defined as such—the lower the value, the less optimal it is for that fitness function (with the exception of model size, in which case the higher the value, the larger the size of the model).

For example, let’s take a use case where you would like to use a large summarization model (such as Anthropic Claude) to create work summaries of service cases and customer engagements based on case data and customer history. We have the following spider chart.

Because this may involve sensitive customer data, you’re choosing to isolate this workload from other models and host it on a single-model endpoint, which can make it challenging to scale because you have to spin up and manage separate endpoints for each FM. The generative AI application you’re using the model with is being used by service agents in real time, so latency and throughput are a priority, hence the need to use larger instance types, such as a P4De. In this situation, the cost may have to be higher because the priority is isolation, latency, and throughput.

Another use case would be a service organization building a Q&A chatbot application that is customized for a large number of customers. The following spider chart reflects their priorities.

Each chatbot experience may need to be tailored to each specific customer. The models being used may be relatively smaller (FLAN-T5-XXL, Llama 7B, and k-NN), and each chatbot operates at a designated set of hours for different time zones each day. The solution may also have Retrieval Augmented Generation (RAG) incorporated with a database containing all the knowledge base items to be used with inference in real time. There isn’t any customer-specific data being exchanged through this chatbot. Cold start latencies are tolerable because the chatbots operate on a defined schedule. For this use case, you can choose a multi-model endpoint architecture, and may be able minimize cost by using smaller instance types (like a G5) and potentially reduce operational overhead by hosting multiple models on each endpoint at scale. With the exception of workload isolation, fitness functions in this use case may have more of an even priority, and trade-offs are minimized to an extent.

One final example would be an image generation application using a model like Stable Diffusion 2.0, which is a 3.5-billion-parameter model. Our spider chart is as follows.

This is a subscription-based application serving thousands of FMs and customers. The response time needs to be quick because each customer expects a fast turnaround of image outputs. Throughput is critical as well because there will be hundreds of thousands of requests at any given second, so the instance type will have to be a larger instance type, like a P4D that has enough GPU and memory. For this you can consider building a multi-container endpoint hosting multiple copies of the model to denoise image generation from one request set to another. For this use case, in order to prioritize latency and throughput and accommodate user demand, cost of compute and workload isolation will be the trade-offs.

Applying fitness functions to selecting the FM hosting option

In this section, we show you how to apply the preceding fitness functions in selecting the right FM hosting option on SageMaker FMs at scale.

SageMaker single-model endpoints

SageMaker single-model endpoints allow you to host one FM on a container hosted on dedicated instances for low latency and high throughput. These endpoints are fully managed and support auto scaling. You can configure the single-model endpoint as a provisioned endpoint where you pass in endpoint infrastructure configuration such as the instance type and count, where SageMaker automatically launches compute resources and scales them in and out depending on the auto scaling policy. You can scale to hosting hundreds of models using multiple single-model endpoints and employ a cell-based architecture for increased resiliency and reduced blast-radius.

When evaluating fitness functions for a provisioned single-model endpoint, consider the following:

  • Foundation model size – This is suitable if you have models that can’t fit into single ML accelerator’s memory and therefore need multiple accelerators in an instance.
  • Performance and FM inference latency – This is relevant for latency-critical generative AI applications.
  • Workload isolation – Your application may need Amazon Elastic Compute Cloud (Amazon EC2) instance-level isolation due to security compliance reasons. Each FM will get a separate inference endpoint and won’t share the EC2 instance with another other model. For example, you can isolate a HIPAA-related model inference workload (such as a PHI detection model) in a separate endpoint with a dedicated security group configuration with network isolation. You can isolate your GPU-based model inference workload from others based on Nitro-based EC2 instances like p4dn in order to isolate them from less trusted workloads. The Nitro System-based EC2 instances provide a unique approach to virtualization and isolation, enabling you to secure and isolate sensitive data processing from AWS operators and software at all times. It provides the most important dimension of confidential computing as an intrinsic, on-by-default set of protections from the system software and cloud operators. This option also supports deploying AWS Marketplace models provided by third-party model providers on SageMaker.

SageMaker multi-model endpoints

SageMaker multi-model endpoints (MMEs) allow you to co-host multiple models on a GPU core, share GPU instances behind an endpoint across multiple models, and dynamically load and unload models based on the incoming traffic. With this, you can significantly save cost and achieve the best price-performance.

MMEs are the best choice if you need to host smaller models that can all fit into a single ML accelerator on an instance. This strategy should be considered if you have a large number (up to thousands) of similar sized (fewer than 1 billion parameters) models that you can serve through a shared container within an instance and don’t need to access all the models at the same time. You can load the model that needs to be used and then unload it for a different model.

MMEs are also designed for co-hosting models that use the same ML framework because they use the shared container to load multiple models. Therefore, if you have a mix of ML frameworks in your model fleet (such as PyTorch and TensorFlow), a SageMaker endpoint with InferenceComponents is a better choice. We discuss InferenceComponents more later in this post.

Finally, MMEs are suitable for applications that can tolerate an occasional cold start latency penalty because infrequently used models can be off-loaded in favor of frequently invoked models. If you have a long tail of infrequently accessed models, a multi-model endpoint can efficiently serve this traffic and enable significant cost savings.

Consider the following when assessing when to use MMEs:

  • Foundation model size – You may have models that fit into single ML accelerator’s HBM on an instance and therefore don’t need multiple accelerators.
  • Performance and FM inference latency – You may have generative AI applications that can tolerate cold start latency when the model is requested and is not in the memory.
  • Workload isolation – Consider having all the models share the same container.
  • Scalability – Consider the following:
    • You can pack multiple models in a single endpoint and scale per model and ML instance.
    • You can enable instance-level auto scaling based on workload patterns.
    • MMEs support scaling to thousands of models per endpoint. You don’t need to maintain per-model auto scaling and deployment configuration.
    • You can use hot deployment whenever the model is requested by the inference request.
    • You can load the models dynamically as per the inference request and unload in response to memory pressure.
    • You can time share the underlying the resources with the models.
  • Cost-efficiency – Consider time sharing the resource across the models by dynamic loading and unloading of the models, resulting in cost savings.

SageMaker inference endpoint with InferenceComponents

The new SageMaker inference endpoint with InferenceComponents provides a scalable approach to hosting multiple FMs in a single endpoint and scaling per model. It provides you with fine-grained control to allocate resources (accelerators, memory, CPU) and set auto scaling policies on a per-model basis to get assured throughput and predictable performance, and you can manage the utilization of compute across multiple models individually. If you have a lot of models of varying sizes and traffic patterns that you need to host, and the model sizes don’t allow them to fit in a single accelerator’s memory, this is the best option. It also allows you to scale to zero to save costs, but your application latency requirements need to be flexible enough to account for a cold start time for models. This option allows you the most flexibility in utilizing your compute as long as container-level isolation per customer or FM is sufficient. For more details on the new SageMaker endpoint with InferenceComponents, refer to the detailed post Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker.

Consider the following when determining when you should use an endpoint with InferenceComponents:

  • Foundation model size – This is suitable for models that can’t fit into single ML accelerator’s memory and therefore need multiple accelerators in an instance.
  • Performance and FM inference latency – This is suitable for latency-critical generative AI applications.
  • Workload isolation – You may have applications where container-level isolation is sufficient.
  • Scalability – Consider the following:
    • You can pack multiple FMs in a single endpoint and scale per model.
    • You can enable instance-level and model container-level scaling based on workload patterns.
    • This method supports scaling to hundreds of FMs per endpoint. You don’t need to configure the auto scaling policy for each model or container.
    • It supports the initial placement of the models in the fleet and handling insufficient accelerators.
  • Cost-efficiency – You can scale to zero per model when there is no traffic to save costs.

Packing multiple FMs on same endpoint: Model grouping

Determining what inference architecture strategy you employ on SageMaker depends on your application priorities and requirements. Some SaaS providers are selling into regulated environments that impose strict isolation requirements—they need to have an option that enables them to offer to some or all of their FMs the option of being deployed in a dedicated model. But in order to optimize costs and gain economies of scale, SaaS providers need to also have multi-tenant environments where they host multiple FMs across a shared set of SageMaker resources. Most organizations will probably have a hybrid hosting environment where they have both single-model endpoints and multi-model or multi-container endpoints as part of their SageMaker architecture.

A critical exercise you will need to perform when architecting this distributed inference environment is to group your models for each type of architecture, you’ll need to set up in your SageMaker endpoints. The first decision you’ll have to make is around workload isolation requirements—you will need to isolate the FMs that need to be in their own dedicated endpoints, whether it’s for security reasons, reducing the blast-radius and noisy neighbor risk, or meeting strict SLAs for latency.

Secondly, you’ll need to determine whether the FMs fit into a single ML accelerator or require multiple accelerators, what the model sizes are, and what their traffic patterns are. Similar sized models that collectively serve to support a central function could logically be grouped together by co-hosting multiple models on an endpoint, because these would be part of a single business application that is managed by a central team. For co-hosting multiple models on the same endpoint, a grouping exercise needs to be performed to determine which models can sit in a single instance, a single container, or multiple containers.

Grouping the models for MMEs

MMEs are best suited for smaller models (fewer than 1 billion parameters that can fit into single accelerator) and are of similar in size and invocation latencies. Some variation in model size is acceptable; for example, Zendesk’s models range from 10–50 MB, which works fine, but variations in size that are a factor of 10, 50, or 100 times greater aren’t suitable. Larger models may cause a higher number of loads and unloads of smaller models to accommodate sufficient memory space, which can result in added latency on the endpoint. Differences in performance characteristics of larger models could also consume resources like CPU unevenly, which could impact other models on the instance.

The models that are grouped together on the MME need to have staggered traffic patterns to allow you to share compute across the models for inference. Your access patterns and inference latency also need to allow for some cold start time as you switch between models.

The following are some of the recommended criteria for grouping the models for MMEs:

  • Smaller models – Use models with fewer than 1 billion parameters
  • Model size – Group similar sized models and co-host into the same endpoint
  • Invocation latency – Group models with similar invocation latency requirements that can tolerate cold starts
  • Hardware – Group the models using the same underlying EC2 instance type

Grouping the models for an endpoint with InferenceComponents

A SageMaker endpoint with InferenceComponents is best suited for hosting larger FMs (over 1 billion parameters) at scale that require multiple ML accelerators or devices in an EC2 instance. This option is suited for latency-sensitive workloads and applications where container-level isolation is sufficient. The following are some of the recommended criteria for grouping the models for an endpoint with multiple InferenceComponents:

  • Hardware – Group the models using the same underlying EC2 instance type
  • Model size – Grouping the model based on model size is recommended but not mandatory

Summary

In this post, we looked at three real-time ML inference options (single endpoints, multi-model endpoints, and endpoints with InferenceComponents) in SageMaker to efficiently host FMs at scale cost-effectively. You can use the five fitness functions to help you choose the right SageMaker hosting option for FMs at scale. Group the FMs and co-host them on SageMaker inference endpoints using the recommended grouping criteria. In addition to the fitness functions we discussed, you can use the following table to decide which shared SageMaker hosting option is best for your use case. You can find code samples for each of the FM hosting options on SageMaker in the following GitHub repos: single SageMaker endpoint, multi-model endpoint, and InferenceComponents endpoint.

. Single-Model Endpoint Multi-Model Endpoint Endpoint with InferenceComponents
Model lifecycle API for management Dynamic through Amazon S3 path API for management
Instance types supported CPU, single and multi GPU, AWS Inferentia based Instances CPU, single GPU based instances CPU, single and multi GPU, AWS Inferentia based Instances
Metric granularity Endpoint Endpoint Endpoint and container
Scaling granularity ML instance ML instance Container
Scaling behavior Independent ML instance scaling Models are loaded and unloaded from memory Independent container scaling
Model pinning . Models can be unloaded based on memory Each container can be configured to be always loaded or unloaded
Container requirements SageMaker pre-built, SageMaker-compatible Bring Your Own Container (BYOC) MMS, Triton, BYOC with MME contracts SageMaker pre-built, SageMaker compatible BYOC
Routing options Random or least connection Random, sticky with popularity window Random or least connection
Hardware allocation for model Dedicated to single model Shared Dedicated for each container
Number of models supported Single Thousands Hundreds
Response streaming Supported Not supported Supported
Data capture Supported Not supported Not supported
Shadow testing Supported Not supported Not supported
Multi-variants Supported Not applicable Not supported
AWS Marketplace models Supported Not applicable Not supported

About the authors

Mehran Najafi, PhD, is a Senior Solutions Architect for AWS focused on AI/ML and SaaS solutions at Scale.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Rielah DeJesus is a Principal Solutions Architect at AWS who has successfully helped various enterprise customers in the DC, Maryland, and Virginia area move to the cloud. A customer advocate and technical advisor, she helps organizations like Heroku/Salesforce achieve success on the AWS platform. She is a staunch supporter of Women in IT and very passionate about finding ways to creatively use technology and data to solve everyday challenges.

Read More