Amazon Scholar received Committee of Presidents of Statistical Societies Presidents’ Award for his achievements in statistics.Read More
Promote pipelines in a multi-environment setup using Amazon SageMaker Model Registry, HashiCorp Terraform, GitHub, and Jenkins CI/CD
Building out a machine learning operations (MLOps) platform in the rapidly evolving landscape of artificial intelligence (AI) and machine learning (ML) for organizations is essential for seamlessly bridging the gap between data science experimentation and deployment while meeting the requirements around model performance, security, and compliance.
In order to fulfill regulatory and compliance requirements, the key requirements when designing such a platform are:
- Address data drift
- Monitor model performance
- Facilitate automatic model retraining
- Provide a process for model approval
- Keep models in a secure environment
In this post, we show how to create an MLOps framework to address these needs while using a combination of AWS services and third-party toolsets. The solution entails a multi-environment setup with automated model retraining, batch inference, and monitoring with Amazon SageMaker Model Monitor, model versioning with SageMaker Model Registry, and a CI/CD pipeline to facilitate promotion of ML code and pipelines across environments by using Amazon SageMaker, Amazon EventBridge, Amazon Simple Notification Service (Amazon S3), HashiCorp Terraform, GitHub, and Jenkins CI/CD. We build a model to predict the severity (benign or malignant) of a mammographic mass lesion trained with the XGBoost algorithm using the publicly available UCI Mammography Mass dataset and deploy it using the MLOps framework. The full instructions with code are available in the GitHub repository.
Solution overview
The following architecture diagram shows an overview of the MLOps framework with the following key components:
- Multi account strategy – Two different environments (dev and prod) are set up in two different AWS accounts following the AWS Well-Architected best practices, and a third account is set up in the central model registry:
- Dev environment – Where an Amazon SageMaker Studio domain is set up to allow model development, model training, and testing of ML pipelines (train and inference), before a model is ready to be promoted to higher environments.
- Prod environment – Where the ML pipelines from dev are promoted to as a first step, and scheduled and monitored over time.
- Central model registry – Amazon SageMaker Model Registry is set up in a separate AWS account to track model versions generated across the dev and prod environments.
- CI/CD and source control – The deployment of ML pipelines across environments is handled through CI/CD set up with Jenkins, along with version control handled through GitHub. Code changes merged to the corresponding environment git branch triggers a CI/CD workflow to make appropriate changes to the given target environment.
- Batch predictions with model monitoring – The inference pipeline built with Amazon SageMaker Pipelines runs on a scheduled basis to generate predictions along with model monitoring using SageMaker Model Monitor to detect data drift.
- Automated retraining mechanism – The training pipeline built with SageMaker Pipelines is triggered whenever a data drift is detected in the inference pipeline. After it’s trained, the model is registered into the central model registry to be approved by a model approver. When it’s approved, the updated model version is used to generate predictions through the inference pipeline.
- Infrastructure as code – The infrastructure as code (IaC), created using HashiCorp Terraform, supports the scheduling of the inference pipeline with EventBridge, triggering of the train pipeline based on an EventBridge rule and sending notifications using Amazon Simple Notification Service (Amazon SNS) topics.
The MLOps workflow includes the following steps:
- Access the SageMaker Studio domain in the development account, clone the GitHub repository, go through the process of model development using the sample model provided, and generate the train and inference pipelines.
- Run the train pipeline in the development account, which generates the model artifacts for the trained model version and registers the model into SageMaker Model Registry in the central model registry account.
- Approve the model in SageMaker Model Registry in the central model registry account.
- Push the code (train and inference pipelines, and the Terraform IaC code to create the EventBridge schedule, EventBridge rule, and SNS topic) into a feature branch of the GitHub repository. Create a pull request to merge the code into the main branch of the GitHub repository.
- Trigger the Jenkins CI/CD pipeline, which is set up with the GitHub repository. The CI/CD pipeline deploys the code into the prod account to create the train and inference pipelines along with Terraform code to provision the EventBridge schedule, EventBridge rule, and SNS topic.
- The inference pipeline is scheduled to run on a daily basis, whereas the train pipeline is set up to run whenever data drift is detected from the inference pipeline.
- Notifications are sent through the SNS topic whenever there is a failure with either the train or inference pipeline.
Prerequisites
For this solution, you should have the following prerequisites:
- Three AWS accounts (dev, prod, and central model registry accounts)
- A SageMaker Studio domain set up in each of the three AWS accounts (see Onboard to Amazon SageMaker Studio or watch the video Onboard Quickly to Amazon SageMaker Studio for setup instructions)
- Jenkins (we use Jenkins 2.401.1) with administrative privileges installed on AWS
- Terraform version 1.5.5 or later installed on Jenkins server
For this post, we work in the us-east-1
Region to deploy the solution.
Provision KMS keys in dev and prod accounts
Our first step is to create AWS Key Management Service (AWS KMS) keys in the dev and prod accounts.
Create a KMS key in the dev account and give access to the prod account
Complete the following steps to create a KMS key in the dev account:
- On the AWS KMS console, choose Customer managed keys in the navigation pane.
- Choose Create key.
- For Key type, select Symmetric.
- For Key usage, select Encrypt and decrypt.
- Choose Next.
- Enter the production account number to give the production account access to the KMS key provisioned in the dev account. This is a required step because the first time the model is trained in the dev account, the model artifacts are encrypted with the KMS key before being written to the S3 bucket in the central model registry account. The production account needs access to the KMS key in order to decrypt the model artifacts and run the inference pipeline.
- Choose Next and finish creating your key.
After the key is provisioned, it should be visible on the AWS KMS console.
Create a KMS key in the prod account
Go through the same steps in the previous section to create a customer managed KMS key in the prod account. You can skip the step to share the KMS key to another account.
Set up a model artifacts S3 bucket in the central model registry account
Create an S3 bucket of your choice with the string sagemaker
in the naming convention as part of the bucket’s name in the central model registry account, and update the bucket policy on the S3 bucket to give permissions from both the dev and prod accounts to read and write model artifacts into the S3 bucket.
The following code is the bucket policy to be updated on the S3 bucket:
Set up IAM roles in your AWS accounts
The next step is to set up AWS Identity and Access Management (IAM) roles in your AWS accounts with permissions for AWS Lambda, SageMaker, and Jenkins.
Lambda execution role
Set up Lambda execution roles in the dev and prod accounts, which will be used by the Lambda function run as part of the SageMaker Pipelines Lambda step. This step will run from the inference pipeline to fetch the latest approved model, using which inferences are generated. Create IAM roles in the dev and prod accounts with the naming convention arn:aws:iam::<account-id>:role/lambda-sagemaker-role
and attach the following IAM policies:
- Policy 1 – Create an inline policy named
cross-account-model-registry-access
, which gives access to the model package set up in the model registry in the central account: - Policy 2 – Attach AmazonSageMakerFullAccess, which is an AWS managed policy that grants full access to SageMaker. It also provides select access to related services, such as AWS Application Auto Scaling, Amazon S3, Amazon Elastic Container Registry (Amazon ECR), and Amazon CloudWatch Logs.
- Policy 3 – Attach AWSLambda_FullAccess, which is an AWS managed policy that grants full access to Lambda, Lambda console features, and other related AWS services.
- Policy 4 – Use the following IAM trust policy for the IAM role:
SageMaker execution role
The SageMaker Studio domains set up in the dev and prod accounts should each have an execution role associated, which can be found on the Domain settings tab on the domain details page, as shown in the following screenshot. This role is used to run training jobs, processing jobs, and more within the SageMaker Studio domain.
Add the following policies to the SageMaker execution role in both accounts:
- Policy 1 – Create an inline policy named
cross-account-model-artifacts-s3-bucket-access
, which gives access to the S3 bucket in the central model registry account, which stores the model artifacts: - Policy 2 – Create an inline policy named
cross-account-model-registry-access
, which gives access to the model package in the model registry in the central model registry account: - Policy 3 – Create an inline policy named
kms-key-access-policy
, which gives access to the KMS key created in the previous step. Provide the account ID in which the policy is being created and the KMS key ID created in that account. - Policy 4 – Attach AmazonSageMakerFullAccess, which is an AWS managed policy that grants full access to SageMaker and select access to related services.
- Policy 5 – Attach AWSLambda_FullAccess, which is an AWS managed policy that grants full access to Lambda, Lambda console features, and other related AWS services.
- Policy 6 – Attach CloudWatchEventsFullAccess, which is an AWS managed policy that grants full access to CloudWatch Events.
- Policy 7 – Add the following IAM trust policy for the SageMaker execution IAM role:
- Policy 8 (specific to the SageMaker execution role in the prod account) – Create an inline policy named
cross-account-kms-key-access-policy
, which gives access to the KMS key created in the dev account. This is required for the inference pipeline to read model artifacts stored in the central model registry account where the model artifacts are encrypted using the KMS key from the dev account when the first version of the model is created from the dev account.
Cross-account Jenkins role
Set up an IAM role called cross-account-jenkins-role
in the prod account, which Jenkins will assume to deploy ML pipelines and corresponding infrastructure into the prod account.
Add the following managed IAM policies to the role:
CloudWatchFullAccess
AmazonS3FullAccess
AmazonSNSFullAccess
AmazonSageMakerFullAccess
AmazonEventBridgeFullAccess
AWSLambda_FullAccess
Update the trust relationship on the role to give permissions to the AWS account hosting the Jenkins server:
Update permissions on the IAM role associated with the Jenkins server
Assuming that Jenkins has been set up on AWS, update the IAM role associated with Jenkins to add the following policies, which will give Jenkins access to deploy the resources into the prod account:
- Policy 1 – Create the following inline policy named
assume-production-role-policy
: - Policy 2 – Attach the
CloudWatchFullAccess
managed IAM policy.
Set up the model package group in the central model registry account
From the SageMaker Studio domain in the central model registry account, create a model package group called mammo-severity-model-package
using the following code snippet (which you can run using a Jupyter notebook):
Set up access to the model package for IAM roles in the dev and prod accounts
Provision access to the SageMaker execution roles created in the dev and prod accounts so you can register model versions within the model package mammo-severity-model-package
in the central model registry from both accounts. From the SageMaker Studio domain in the central model registry account, run the following code in a Jupyter notebook:
Set up Jenkins
In this section, we configure Jenkins to create the ML pipelines and the corresponding Terraform infrastructure in the prod account through the Jenkins CI/CD pipeline.
- On the CloudWatch console, create a log group named
jenkins-log
within the prod account to which Jenkins will push logs from the CI/CD pipeline. The log group should be created in the same Region as where the Jenkins server is set up. - Install the following plugins on your Jenkins server:
- Set up AWS credentials in Jenkins using the cross-account IAM role (
cross-account-jenkins-role
) provisioned in the prod account. - For System Configuration, choose AWS.
- Provide the credentials and CloudWatch log group you created earlier.
- Set up GitHub credentials within Jenkins.
- Create a new project in Jenkins.
- Enter a project name and choose Pipeline.
- On the General tab, select GitHub project and enter the forked GitHub repository URL.
- Select This project is parameterized.
- On the Add Parameter menu, choose String Parameter.
- For Name, enter
prodAccount
. - For Default Value, enter the prod account ID.
- Under Advanced Project Options, for Definition, select Pipeline script from SCM.
- For SCM, choose Git.
- For Repository URL, enter the forked GitHub repository URL.
- For Credentials, enter the GitHub credentials saved in Jenkins.
- Enter
main
in the Branches to build section, based on which the CI/CD pipeline will be triggered. - For Script Path, enter
Jenkinsfile
. - Choose Save.
The Jenkins pipeline should be created and visible on your dashboard.
Provision S3 buckets, collect and prepare data
Complete the following steps to set up your S3 buckets and data:
- Create an S3 bucket of your choice with the string
sagemaker
in the naming convention as part of the bucket’s name in both dev and prod accounts to store datasets and model artifacts. - Set up an S3 bucket to maintain the Terraform state in the prod account.
- Download and save the publicly available UCI Mammography Mass dataset to the S3 bucket you created earlier in the dev account.
- Fork and clone the GitHub repository within the SageMaker Studio domain in the dev account. The repo has the following folder structure:
- /environments – Configuration script for prod environment
- /mlops-infra – Code for deploying AWS services using Terraform code
- /pipelines – Code for SageMaker pipeline components
- Jenkinsfile – Script to deploy through Jenkins CI/CD pipeline
- setup.py – Needed to install the required Python modules and create the run-pipeline command
- mammography-severity-modeling.ipynb – Allows you to create and run the ML workflow
- Create a folder called data within the cloned GitHub repository folder and save a copy of the publicly available UCI Mammography Mass dataset.
- Follow the Jupyter notebook
mammography-severity-modeling.ipynb
. - Run the following code in the notebook to preprocess the dataset and upload it to the S3 bucket in the dev account:
The code will generate the following datasets:
-
- data/ mammo-train-dataset-part1.csv – Will be used to train the first version of model.
- data/ mammo-train-dataset-part2.csv – Will be used to train the second version of model along with the mammo-train-dataset-part1.csv dataset.
- data/mammo-batch-dataset.csv – Will be used to generate inferences.
- data/mammo-batch-dataset-outliers.csv – Will introduce outliers into the dataset to fail the inference pipeline. This will enable us to test the pattern to trigger automated retraining of the model.
- Upload the dataset
mammo-train-dataset-part1.csv
under the prefixmammography-severity-model/train-dataset
, and upload the datasetsmammo-batch-dataset.csv
andmammo-batch-dataset-outliers.csv
to the prefixmammography-severity-model/batch-dataset
of the S3 bucket created in the dev account: - Upload the datasets
mammo-train-dataset-part1.csv
andmammo-train-dataset-part2.csv
under the prefixmammography-severity-model/train-dataset
into the S3 bucket created in the prod account through the Amazon S3 console. - Upload the datasets
mammo-batch-dataset.csv
andmammo-batch-dataset-outliers.csv
to the prefixmammography-severity-model/batch-dataset
of the S3 bucket in the prod account.
Run the train pipeline
Under <project-name>/pipelines/train
, you can see the following Python scripts:
- scripts/raw_preprocess.py – Integrates with SageMaker Processing for feature engineering
- scripts/evaluate_model.py – Allows model metrics calculation, in this case
auc_score
- train_pipeline.py – Contains the code for the model training pipeline
Complete the following steps:
- Upload the scripts into Amazon S3:
- Get the train pipeline instance:
- Submit the train pipeline and run it:
The following figure shows a successful run of the training pipeline. The final step in the pipeline registers the model in the central model registry account.
Approve the model in the central model registry
Log in to the central model registry account and access the SageMaker model registry within the SageMaker Studio domain. Change the model version status to Approved.
Once approved, the status should be changed on the model version.
Run the inference pipeline (Optional)
This step is not required but you can still run the inference pipeline to generate predictions in the dev account.
Under <project-name>/pipelines/inference
, you can see the following Python scripts:
- scripts/lambda_helper.py – Pulls the latest approved model version from the central model registry account using a SageMaker Pipelines Lambda step
- inference_pipeline.py – Contains the code for the model inference pipeline
Complete the following steps:
- Upload the script to the S3 bucket:
- Get the inference pipeline instance using the normal batch dataset:
- Submit the inference pipeline and run it:
The following figure shows a successful run of the inference pipeline. The final step in the pipeline generates the predictions and stores them in the S3 bucket. We use MonitorBatchTransformStep to monitor the inputs into the batch transform job. If there are any outliers, the inference pipeline goes into a failed state.
Run the Jenkins pipeline
The environment/
folder within the GitHub repository contains the configuration script for the prod account. Complete the following steps to trigger the Jenkins pipeline:
- Update the config script
prod.tfvars.json
based on the resources created in the previous steps: - Once updated, push the code into the forked GitHub repository and merge the code into main branch.
- Go to the Jenkins UI, choose Build with Parameters, and trigger the CI/CD pipeline created in the previous steps.
When the build is complete and successful, you can log in to the prod account and see the train and inference pipelines within the SageMaker Studio domain.
Additionally, you will see three EventBridge rules on the EventBridge console in the prod account:
- Schedule the inference pipeline
- Send a failure notification on the train pipeline
- When the inference pipeline fails to trigger the train pipeline, send a notification
Finally, you will see an SNS notification topic on the Amazon SNS console that sends notifications through email. You’ll get an email asking you to confirm the acceptance of these notification emails.
Test the inference pipeline using a batch dataset without outliers
To test if the inference pipeline is working as expected in the prod account, we can log in to the prod account and trigger the inference pipeline using the batch dataset without outliers.
Run the pipeline via the SageMaker Pipelines console in the SageMaker Studio domain of the prod account, where the transform_input
will be the S3 URI of the dataset without outliers (s3://<s3-bucket-in-prod-account>/mammography-severity-model/data/mammo-batch-dataset.csv
).
The inference pipeline succeeds and writes the predictions back to the S3 bucket.
Test the inference pipeline using a batch dataset with outliers
You can run the inference pipeline using the batch dataset with outliers to check if the automated retraining mechanism works as expected.
Run the pipeline via the SageMaker Pipelines console in the SageMaker Studio domain of the prod account, where the transform_input
will be the S3 URI of the dataset with outliers (s3://<s3-bucket-in-prod-account>/mammography-severity-model/data/mammo-batch-dataset-outliers.csv
).
The inference pipeline fails as expected, which triggers the EventBridge rule, which in turn triggers the train pipeline.
After a few moments, you should see a new run of the train pipeline on the SageMaker Pipelines console, which picks up the two different train datasets (mammo-train-dataset-part1.csv
and mammo-train-dataset-part2.csv
) uploaded to the S3 bucket to retrain the model.
You will also see a notification sent to the email subscribed to the SNS topic.
To use the updated model version, log in to the central model registry account and approve the model version, which will be picked up during the next run of the inference pipeline triggered through the scheduled EventBridge rule.
Although the train and inference pipelines use a static dataset URL, you can have the dataset URL passed to the train and inference pipelines as dynamic variables in order to use updated datasets to retrain the model and generate predictions in a real-world scenario.
Clean up
To avoid incurring future charges, complete the following steps:
- Remove the SageMaker Studio domain across all the AWS accounts.
- Delete all the resources created outside SageMaker, including the S3 buckets, IAM roles, EventBridge rules, and SNS topic set up through Terraform in the prod account.
- Delete the SageMaker pipelines created across accounts using the AWS Command Line Interface (AWS CLI).
Conclusion
Organizations often need to align with enterprise-wide toolsets to enable collaboration across different functional areas and teams. This collaboration ensures that your MLOps platform can adapt to evolving business needs and accelerates the adoption of ML across teams. This post explained how to create an MLOps framework in a multi-environment setup to enable automated model retraining, batch inference, and monitoring with Amazon SageMaker Model Monitor, model versioning with SageMaker Model Registry, and promotion of ML code and pipelines across environments with a CI/CD pipeline. We showcased this solution using a combination of AWS services and third-party toolsets. For instructions on implementing this solution, see the GitHub repository. You can also extend this solution by bringing in your own data sources and modeling frameworks.
About the Authors
Gayatri Ghanakota is a Sr. Machine Learning Engineer with AWS Professional Services. She is passionate about developing, deploying, and explaining AI/ ML solutions across various domains. Prior to this role, she led multiple initiatives as a data scientist and ML engineer with top global firms in the financial and retail space. She holds a master’s degree in Computer Science specialized in Data Science from the University of Colorado, Boulder.
Sunita Koppar is a Sr. Data Lake Architect with AWS Professional Services. She is passionate about solving customer pain points processing big data and providing long-term scalable solutions. Prior to this role, she developed products in internet, telecom, and automotive domains, and has been an AWS customer. She holds a master’s degree in Data Science from the University of California, Riverside.
Saswata Dash is a DevOps Consultant with AWS Professional Services. She has worked with customers across healthcare and life sciences, aviation, and manufacturing. She is passionate about all things automation and has comprehensive experience in designing and building enterprise-scale customer solutions in AWS. Outside of work, she pursues her passion for photography and catching sunrises.
Customizing coding companions for organizations
Generative AI models for coding companions are mostly trained on publicly available source code and natural language text. While the large size of the training corpus enables the models to generate code for commonly used functionality, these models are unaware of code in private repositories and the associated coding styles that are enforced when developing with them. Consequently, the generated suggestions may require rewriting before they are appropriate for incorporation into an internal repository.
We can address this gap and minimize additional manual editing by embedding code knowledge from private repositories on top of a language model trained on public code. This is why we developed a customization capability for Amazon CodeWhisperer. In this post, we show you two possible ways of customizing coding companions using retrieval augmented generation and fine-tuning.
Our goal with CodeWhisperer customization capability is to enable organizations to tailor the CodeWhisperer model using their private repositories and libraries to generate organization-specific code recommendations that save time, follow organizational style and conventions, and avoid bugs or security vulnerabilities. This benefits enterprise software development and helps overcome the following challenges:
- Sparse documentation or information for internal libraries and APIs that forces developers to spend time examining previously written code to replicate usage.
- Lack of awareness and consistency in implementing enterprise-specific coding practices, styles and patterns.
- Inadvertent use of deprecated code and APIs by developers.
By using internal code repositories for additional training that have already undergone code reviews, the language model can surface the use of internal APIs and code blocks that overcome the preceding list of problems. Because the reference code is already reviewed and meets the customer’s high bar, the likelihood of introducing bugs or security vulnerabilities is also minimized. And, by carefully selecting of the source files used for customization, organizations can reduce the use of deprecated code.
Design challenges
Customizing code suggestions based on an organization’s private repositories has many interesting design challenges. Deploying large language models (LLMs) to surface code suggestions has fixed costs for availability and variable costs due to inference based on the number of tokens generated. Therefore, having separate customizations for each customer and hosting them individually, thereby incurring additional fixed costs, can be prohibitively expensive. On the other hand, having multiple customizations simultaneously on the same system necessitates multi-tenant infrastructure to isolate proprietary code for each customer. Furthermore, the customization capability should surface knobs to enable the selection of the appropriate training subset from the internal repository using different metrics (for example, files with a history of fewer bugs or code that is recently committed into the repository). By selecting the code based on these metrics, the customization can be trained using higher-quality code which can improve the quality of code suggestions. Finally, even with continuously evolving code repositories, the cost associated with customization should be minimal to help enterprises realize cost savings from increased developer productivity.
A baseline approach to building customization could be to pretrain the model on a single training corpus composed of of the existing (public) pretraining dataset along with the (private) enterprise code. While this approach works in practice, it requires (redundant) individual pretraining using the public dataset for each enterprise. It also requires redundant deployment costs associated with hosting a customized model for each customer that only serves client requests originating from that customer. By decoupling the training of public and private code and deploying the customization on a multi-tenant system, these redundant costs can be avoided.
How to customize
At a high level, there are two types of possible customization techniques: retrieval-augmented generation (RAG) and fine-tuning (FT).
- Retrieval-augmented generation: RAG finds matching pieces of code within a repository that is similar to a given code fragment (for example, code that immediately precedes the cursor in the IDE) and augments the prompt used to query the LLM with these matched code snippets. This enriches the prompt to help nudge the model into generating more relevant code. There are a few techniques explored in the literature along these lines. See Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, REALM, kNN-LM and RETRO.
- Fine-tuning: FT takes a pre-trained LLM and trains it further on a specific, smaller codebase (compared to the pretraining dataset) to adapt it for the appropriate repository. Fine-tuning adjusts the LLM’s weights based on this training, making it more tailored to the organization’s unique needs.
Both RAG and fine-tuning are powerful tools for enhancing the performance of LLM-based customization. RAG can quickly adapt to private libraries or APIs with lower training complexity and cost. However, searching and augmenting retrieved code snippets to the prompt increases latency at runtime. Instead, fine-tuning does not require any augmentation of the context because the model is already trained on private libraries and APIs. However, it leads to higher training costs and complexities in serving the model, when multiple custom models have to be supported across multiple enterprise customers. As we discuss later, these concerns can be remedied by optimizing the approach further.
Retrieval augmented generation
There are a few steps involved in RAG:
Indexing
Given a private repository as input by the admin, an index is created by splitting the source code files into chunks. Put simply, chunking turns the code snippets into digestible pieces that are likely to be most informative for the model and are easy to retrieve given the context. The size of a chunk and how it is extracted from a file are design choices that affect the final result. For example, chunks can be split based on lines of code or based on syntactic blocks, and so on.
Administrator Workflow
Contextual search
Search a set of indexed code snippets based on a few lines of code above the cursor and retrieve relevant code snippets. This retrieval can happen using different algorithms. These choices might include:
- Bag of words (BM25) – A bag-of-words retrieval function that ranks a set of code snippets based on the query term frequencies and code snippet lengths.
BM25-based retrieval
The following figure illustrates how BM25 works. In order to use BM25, an inverted index is built first. This is a data structure that maps different terms to the code snippets that those terms occur in. At search time, we look up code snippets based on the terms present in the query and score them based on the frequency.
- Semantic retrieval [Contriever, UniXcoder] – Converts query and indexed code snippets into high-dimensional vectors and ranks code snippets based on semantic similarity. Formally, often k-nearest neighbors (KNN) or approximate nearest neighbor (ANN) search is often used to find other snippets with similar semantics.
Semantic retrieval
BM25 focuses on lexical matching. Therefore, replacing “add” with “delete” may not change the BM25 score based on the terms in the query, but the retrieved functionality may be the opposite of what is required. In contrast, semantic retrieval focuses on the functionality of the code snippet even though variable and API names may be different. Typically, a combination of BM25 and semantic retrievals can work well together to deliver better results.
Augmented inference
When developers write code, their existing program is used to formulate a query that is sent to the retrieval index. After retrieving multiple code snippets using one of the techniques discussed above, we prepend them to the original prompt. There are many design choices here, including the number of snippets to be retrieved, the relative placement of the snippets in the prompt, and the size of the snippet. The final design choice is primarily driven by empirical observation by exploring various approaches with the underlying language model and plays a key role in determining the accuracy of the approach. The contents from the returned chunks and the original code are combined and sent to the model to get customized code suggestions.
Developer workflow
Fine tuning:
Fine-tuning a language model is done for transfer learning in which the weights of a pre-trained model are trained on new data. The goal is to retain the appropriate knowledge from a model already trained on a large corpus and refine, replace, or add new knowledge from the new corpus — in our case, a new codebase. Simply training on a new codebase leads to catastrophic forgetting. For example, the language model may “forget” its knowledge of safety or the APIs that are sparsely used in the enterprise codebase to date. There are a variety of techniques like experience replay, GEM, and PP-TF that are employed to address this challenge.
Fine tuning
There are two ways of fine-tuning. One approach is to use the additional data without augmenting the prompt to fine-tune the model. Another approach is to augment the prompt during fine-tuning by retrieving relevant code suggestions. This helps improve the model’s ability to provide better suggestions in the presence of retrieved code snippets. The model is then evaluated on a held-out set of examples after it is trained. Subsequently, the customized model is deployed and used for generating the code suggestions.
Despite the advantages of using dedicated LLMs for generating code on private repositories, the costs can be prohibitive for small and medium-sized organizations. This is because dedicated compute resources are necessary even though they may be underutilized given the size of the teams. One way to achieve cost efficiency is serving multiple models on the same compute (for example, SageMaker multi-tenancy). However, language models require one or more dedicated GPUs across multiple zones to handle latency and throughput constraints. Hence, multi-tenancy of full model hosting on each GPU is infeasible.
We can overcome this problem by serving multiple customers on the same compute by using small adapters to the LLM. Parameter-efficient fine-tuning (PEFT) techniques like prompt tuning, prefix tuning, and Low-Rank Adaptation (LoRA) are used to lower training costs without any loss of accuracy. LoRA, especially, has seen great success at achieving similar (or better) accuracy than full-model fine-tuning. The basic idea is to design a low-rank matrix that is then added to the matrices with the original matrix weight of targeted layers of the model. Typically, these adapters are then merged with the original model weights for serving. This leads to the same size and architecture as the original neural network. Keeping the adapters separate, we can serve the same base model with many model adapters. This brings the economies of scale back to our small and medium-sized customers.
Low-Rank Adaptation (LoRA)
Measuring effectiveness of customization
We need evaluation metrics to assess the efficacy of the customized solution. Offline evaluation metrics act as guardrails against shipping customizations that are subpar compared to the default model. By building datasets out of a held-out dataset from within the provided repository, the customization approach can be applied to this dataset to measure effectiveness. Comparing the existing source code with the customized code suggestion quantifies the usefulness of the customization. Common measures used for this quantification include metrics like edit similarity, exact match, and CodeBLEU.
It is also possible to measure usefulness by quantifying how often internal APIs are invoked by the customization and comparing it with the invocations in the pre-existing source. Of course, getting both aspects right is important for a successful completion. For our customization approach, we have designed a tailor-made metric known as Customization Quality Index (CQI), a single user-friendly measure ranging between 1 and 10. The CQI metric shows the usefulness of the suggestions from the customized model compared to code suggestions with a generic public model.
Summary
We built Amazon CodeWhisperer customization capability based on a mixture of the leading technical techniques discussed in this blog post and evaluated it with user studies on developer productivity, conducted by Persistent Systems. In these two studies, commissioned by AWS, developers were asked to create a medical software application in Java that required use of their internal libraries. In the first study, developers without access to CodeWhisperer took (on average) ~8.2 hours to complete the task, while those who used CodeWhisperer (without customization) completed the task 62 percent faster in (on average) ~3.1 hours.
In the second study with a different set of developer cohorts, developers using CodeWhisperer that had been customized using their private codebase completed the task in 2.5 hours on average, 28 percent faster than those who were using CodeWhisperer without customization and completed the task in ~3.5 hours on average. We strongly believe tools like CodeWhisperer that are customized to your codebase have a key role to play in further boosting developer productivity and recommend giving it a run. For more information and to get started, visit the Amazon CodeWhisperer page.
About the authors
Qing Sun is a Senior Applied Scientist in AWS AI Labs and work on AWS CodeWhisperer, a generative AI-powered coding assistant. Her research interests lie in Natural Language Processing, AI4Code and generative AI. In the past, she had worked on several NLP-based services such as Comprehend Medical, a medical diagnosis system at Amazon Health AI and Machine Translation system at Meta AI. She received her PhD from Virginia Tech in 2017.
Arash Farahani is an Applied Scientist with Amazon CodeWhisperer. His current interests are in generative AI, search, and personalization. Arash is passionate about building solutions that resolve developer pain points. He has worked on multiple features within CodeWhisperer, and introduced NLP solutions into various internal workstreams that touch all Amazon developers. He received his PhD from University of Illinois at Urbana-Champaign in 2017.
Xiaofei Ma is an Applied Science Manager in AWS AI Labs. He joined Amazon in 2016 as an Applied Scientist within SCOT organization and then later AWS AI Labs in 2018 working on Amazon Kendra. Xiaofei has been serving as the science manager for several services including Kendra, Contact Lens, and most recently CodeWhisperer and CodeGuru Security. His research interests lie in the area of AI4Code and Natural Language Processing. He received his PhD from University of Maryland, College Park in 2010.
Murali Krishna Ramanathan is a Principal Applied Scientist in AWS AI Labs and co-leads AWS CodeWhisperer, a generative AI-powered coding companion. He is passionate about building software tools and workflows that help improve developer productivity. In the past, he built Piranha, an automated refactoring tool to delete code due to stale feature flags and led code quality initiatives at Uber engineering. He is a recipient of the Google faculty award (2015), ACM SIGSOFT Distinguished paper award (ISSTA 2016) and Maurice Halstead award (Purdue 2006). He received his PhD in Computer Science from Purdue University in 2008.
Ramesh Nallapati is a Senior Principal Applied Scientist in AWS AI Labs and co-leads CodeWhisperer, a generative AI-powered coding companion, and Titan Large Language Models at AWS. His interests are mainly in the areas of Natural Language Processing and Generative AI. In the past, Ramesh has provided science leadership in delivering many NLP-based AWS products such as Kendra, Quicksight Q and Contact Lens. He held research positions at Stanford, CMU and IBM Research, and received his Ph.D. in Computer Science from University of Massachusetts Amherst in 2006.
Enter a World of Samurai and Demons: GFN Thursday Brings Capcom’s ‘Onimusha: Warlords’ to the Cloud
Wield the blade and embrace the way of the samurai for some thrilling action — Onimusha: Warlords comes to GeForce NOW this week. Members can experience feudal Japan in this hack-and-slash adventure game in the cloud.
It’s part of an action-packed GFN Thursday, with 16 more games joining the cloud gaming platform’s library.
Forging Destinies
Capcom’s popular Onimusha: Warlords is newly supported in the cloud this week, just in time for those tuning into the recently released Netflix anime adaptation.
Fight against the evil warlord Nobunaga Oda and his army of demons as samurai Samanosuke Akechi. Explore feudal Japan, wield swords, use ninja techniques and solve puzzles to defeat enemies. The action-adventure hack-and-slash game has been enhanced with improved controls for smoother swordplay mechanics, an updated soundtrack and more.
Ultimate members can stream the game in ultrawide resolution with up to eight hours each gaming session for riveting samurai action.
Endless Games
Roguelite fans and GeForce NOW members have been enjoying Sega’s Endless Dungeon in the cloud. Recruit a team of shipwrecked heroes, plunge into a long-abandoned space station and protect the crystal against never-ending waves of monsters. Never accept defeat — get reloaded and try, try again.
On top of that, check out the 16 newly supported games joining the GeForce NOW library this week:
- The Invincible (New release on Steam, Nov. 6)
- Roboquest (New release on Steam, Nov. 7)
- Stronghold: Definitive Edition (New release on Steam, Nov. 7)
- Dungeons 4 (New release on Steam, Xbox and available on PC Game Pass, Nov. 9)
- Space Trash Scavenger (New release on Steam, Nov. 9)
- Airport CEO (Steam)
- Car Mechanic Simulator 2021 (Xbox, available on PC Game Pass)
- Farming Simulator 19 (Xbox, available on Microsoft Store)
- GoNNER (Xbox, available on Microsoft Store)
- GoNNER2 (Xbox, available on Microsoft Store)
- Jurassic World Evolution 2 (Xbox, available on PC Game Pass)
- Onimusha: Warlords (Steam)
- Planet of Lana (Xbox, available on PC Game Pass)
- Q.U.B.E. 10th Anniversary (Epic Games Store)
- Trailmakers (Xbox, available on PC Game Pass)
- Turnip Boy Commits Tax Evasion (Epic Games Store)
What are you planning to play this weekend? Let us know on Twitter or in the comments below.
Share a game that more people should be playing.
— NVIDIA GeForce NOW (@NVIDIAGFN) November 8, 2023
OpenAI Data Partnerships
Working together to create open-source and private datasets for AI training.OpenAI Blog
Build a medical imaging AI inference pipeline with MONAI Deploy on AWS
This post is cowritten with Ming (Melvin) Qin, David Bericat and Brad Genereaux from NVIDIA.
Medical imaging AI researchers and developers need a scalable, enterprise framework to build, deploy, and integrate their AI applications. AWS and NVIDIA have come together to make this vision a reality. AWS, NVIDIA, and other partners build applications and solutions to make healthcare more accessible, affordable, and efficient by accelerating cloud connectivity of enterprise imaging. MONAI Deploy is one of the key modules within MONAI (Medical Open Network for Artificial Intelligence) developed by a consortium of academic and industry leaders, including NVIDIA. AWS HealthImaging (AHI) is a HIPAA-eligible, highly scalable, performant, and cost-effective medical imagery store. We have developed a MONAI Deploy connector to AHI to integrate medical imaging AI applications with subsecond image retrieval latencies at scale powered by cloud-native APIs. The MONAI AI models and applications can be hosted on Amazon SageMaker, which is a fully managed service to deploy machine learning (ML) models at scale. SageMaker takes care of setting up and managing instances for inference and provides built-in metrics and logs for endpoints that you can use to monitor and receive alerts. It also offers a variety of NVIDIA GPU instances for ML inference, as well as multiple model deployment options with automatic scaling, including real-time inference, serverless inference, asynchronous inference, and batch transform.
In this post, we demonstrate how to deploy a MONAI Application Package (MAP) with the connector to AWS HealthImaging, using a SageMaker multi-model endpoint for real-time inference and asynchronous inference. These two options cover a majority of near-real-time medical imaging inference pipeline use cases.
Solution overview
The following diagram illustrates the solution architecture.
Prerequisites
Complete the following prerequisite steps:
- Use an AWS account with one of the following Regions, where AWS HealthImaging is available: North Virginia (
us-east-1
), Oregon (us-west-2
), Ireland (eu-west-1
), and Sydney (ap-southeast-2
). - Create an Amazon SageMaker Studio domain and user profile with AWS Identity and Access Management (IAM) permission to access AWS HealthImaging.
- Enable the JupyterLab v3 extension and install Imjoy-jupyter-extension if you want to visualize medical images on SageMaker notebook interactively using itkwidgets.
MAP connector to AWS HealthImaging
AWS HealthImaging imports DICOM P10 files and converts them into ImageSets, which are a optimized representation of a DICOM series. AHI provides API access to ImageSet metadata and ImageFrames. Metadata contains all DICOM attributes in a JSON document. ImageFrames are returned encoded in the High-Throughput JPEG2000 (HTJ2K) lossless format, which can be decoded extremely fast. ImageSets can be retrieved by using the AWS Command Line Interface (AWS CLI) or the AWS SDKs.
MONAI is a medical imaging AI framework that takes research breakthroughs and AI applications into clinical impact. MONAI Deploy is the processing pipeline that enables the end-to-end workflow, including packaging, testing, deploying, and running medical imaging AI applications in clinical production. It comprises the MONAI Deploy App SDK, MONAI Deploy Express, Workflow Manager, and Informatics Gateway. The MONAI Deploy App SDK provides ready-to-use algorithms and a framework to accelerate building medical imaging AI applications, as well as utility tools to package the application into a MAP container. The built-in standards-based functionalities in the app SDK allow the MAP to smoothly integrate into health IT networks, which requires the use of standards such as DICOM, HL7, and FHIR, and across data center and cloud environments. MAPs can use both predefined and customized operators for DICOM image loading, series selection, model inference, and postprocessing
We have developed a Python module using the AWS HealthImaging Python SDK Boto3. You can pip install it and use the helper function to retrieve DICOM Service-Object Pair (SOP) instances as follows:
!pip install -q AHItoDICOMInterface
from AHItoDICOMInterface.AHItoDICOM import AHItoDICOM
helper = AHItoDICOM()
instances = helper.DICOMizeImageSet(datastore_id=datastoreId , image_set_id=next(iter(imageSetIds)))
The output SOP instances can be visualized using the interactive 3D medical image viewer itkwidgets in the following notebook. The AHItoDICOM class takes advantage of multiple processes to retrieve pixel frames from AWS HealthImaging in parallel, and decode the HTJ2K binary blobs using the Python OpenJPEG library. The ImageSetIds come from the output files of a given AWS HealthImaging import job. Given the DatastoreId and import JobId, you can retrieve the ImageSetId, which is equivalent to the DICOM series instance UID, as follows:
imageSetIds = {}
try:
response = s3.head_object(Bucket=OutputBucketName, Key=f"output/{res_createstore['datastoreId']}-DicomImport-{res_startimportjob['jobId']}/job-output-manifest.json")
if response['ResponseMetadata']['HTTPStatusCode'] == 200:
data = s3.get_object(Bucket=OutputBucketName, Key=f"output/{res_createstore['datastoreId']}-DicomImport-{res_startimportjob['jobId']}/SUCCESS/success.ndjson")
contents = data['Body'].read().decode("utf-8")
for l in contents.splitlines():
isid = json.loads(l)['importResponse']['imageSetId']
if isid in imageSetIds:
imageSetIds[isid]+=1
else:
imageSetIds[isid]=1
except ClientError:
pass
With ImageSetId, you can retrieve the DICOM header metadata and image pixels separately using native AWS HealthImaging API functions. The DICOM exporter aggregates the DICOM headers and image pixels into the Pydicom dataset, which can be processed by the MAP DICOM data loader operator. Using the DICOMizeImageSet()function, we have created a connector to load image data from AWS HealthImaging, based on the MAP DICOM data loader operator:
class AHIDataLoaderOperator(Operator):
def __init__(self, ahi_client, must_load: bool = True, *args, **kwargs):
self.ahi_client = ahi_client
…
def _load_data(self, input_obj: string):
study_dict = {}
series_dict = {}
sop_instances = self.ahi_client.DICOMizeImageSet(input_obj['datastoreId'], input_obj['imageSetId'])
In the preceding code, ahi_client
is an instance of the AHItoDICOM DICOM exporter class, with data retrieval functions illustrated. We have included this new data loader operator into a 3D spleen segmentation AI application created by the MONAI Deploy App SDK. You can first explore how to create and run this application on a local notebook instance, and then deploy this MAP application into SageMaker managed inference endpoints.
SageMaker asynchronous inference
A SageMaker asynchronous inference endpoint is used for requests with large payload sizes (up to 1 GB), long processing times (up to 15 minutes), and near-real-time latency requirements. When there are no requests to process, this deployment option can downscale the instance count to zero for cost savings, which is ideal for medical imaging ML inference workloads. Follow the steps in the sample notebook to create and invoke the SageMaker asynchronous inference endpoint. To create an asynchronous inference endpoint, you will need to create a SageMaker model and endpoint configuration first. To create a SageMaker model, you will need to load a model.tar.gz package with a defined directory structure into a Docker container. The model.tar.gz package includes a pre-trained spleen segmentation model.ts file and a customized inference.py file. We have used a prebuilt container with Python 3.8 and PyTorch 1.12.1 framework versions to load the model and run predictions.
In the customized inference.py file, we instantiate an AHItoDICOM helper class from AHItoDICOMInterface and use it to create a MAP instance in the model_fn()
function, and we run the MAP application on every inference request in the predict_fn()
function:
from app import AISpleenSegApp
from AHItoDICOMInterface.AHItoDICOM import AHItoDICOM
helper = AHItoDICOM()
def model_fn(model_dir, context):
…
monai_app_instance = AISpleenSegApp(helper, do_run=False,path="/home/model-server")
def predict_fn(input_data, model):
with open('/home/model-server/inputImageSets.json', 'w') as f:
f.write(json.dumps(input_data))
output_folder = "/home/model-server/output"
if not os.path.exists(output_folder):
os.makedirs(output_folder)
model.run(input='/home/model-server/inputImageSets.json', output=output_folder, workdir='/home/model-server', model='/opt/ml/model/model.ts')
To invoke the asynchronous endpoint, you will need to upload the request input payload to Amazon Simple Storage Service (Amazon S3), which is a JSON file specifying the AWS HealthImaging datastore ID and ImageSet ID to run inference on:
sess = sagemaker.Session()
InputLocation = sess.upload_data('inputImageSets.json', bucket=sess.default_bucket(), key_prefix=prefix, extra_args={"ContentType": "application/json"})
response = runtime_sm_client.invoke_endpoint_async(EndpointName=endpoint_name, InputLocation=InputLocation, ContentType="application/json", Accept="application/json")
output_location = response["OutputLocation"]
The output can be found in Amazon S3 as well.
SageMaker multi-model real-time inference
SageMaker real-time inference endpoints meet interactive, low-latency requirements. This option can host multiple models in one container behind one endpoint, which is a scalable and cost-effective solution to deploying several ML models. A SageMaker multi-model endpoint uses NVIDIA Triton Inference Server with GPU to run multiple deep learning model inferences.
In this section, we walk through how to create and invoke a multi-model endpoint adapting your own inference container in the following sample notebook. Different models can be served in a shared container on the same fleet of resources. Multi-model endpoints reduce deployment overhead and scale model inferences based on the traffic patterns to the endpoint. We used AWS developer tools including Amazon CodeCommit, Amazon CodeBuild, and Amazon CodePipeline to build the customized container for SageMaker model inference. We prepared a model_handler.py to bring your own container instead of the inference.py file in the previous example, and implemented the initialize(), preprocess(), and inference() functions:
from app import AISpleenSegApp
from AHItoDICOMInterface.AHItoDICOM import AHItoDICOM
class ModelHandler(object):
def __init__(self):
self.initialized = False
self.shapes = None
def initialize(self, context):
self.initialized = True
properties = context.system_properties
model_dir = properties.get("model_dir")
gpu_id = properties.get("gpu_id")
helper = AHItoDICOM()
self.monai_app_instance = AISpleenSegApp(helper, do_run=False, path="/home/model-server/")
def preprocess(self, request):
inputStr = request[0].get("body").decode('UTF8')
datastoreId = json.loads(inputStr)['inputs'][0]['datastoreId']
imageSetId = json.loads(inputStr)['inputs'][0]['imageSetId']
with open('/tmp/inputImageSets.json', 'w') as f:
f.write(json.dumps({"datastoreId": datastoreId, "imageSetId": imageSetId}))
return '/tmp/inputImageSets.json'
def inference(self, model_input):
self.monai_app_instance.run(input=model_input, output="/home/model-server/output/", workdir="/home/model-server/", model=os.environ["model_dir"]+"/model.ts")
After the container is built and pushed to Amazon Elastic Container Registry (Amazon ECR), you can create SageMaker model with it, plus different model packages (tar.gz files) in a given Amazon S3 path:
model_name = "DEMO-MONAIDeployModel" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model_url = "s3://{}/{}/".format(bucket, prefix)
container = "{}.dkr.ecr.{}.amazonaws.com/{}:dev".format( account_id, region, prefix )
container = {"Image": container, "ModelDataUrl": model_url, "Mode": "MultiModel"}
create_model_response = sm_client.create_model(ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=container)
It’s noteworthy that the model_url
here only specifies the path to a folder of tar.gz files, and you specify which model package to use for inference when you invoke the endpoint, as shown in the following code:
Payload = {"inputs": [ {"datastoreId": datastoreId, "imageSetId": next(iter(imageSetIds))} ]}
response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name, ContentType="application/json", Accept="application/json", TargetModel="model.tar.gz", Body=json.dumps(Payload))
We can add more models to the existing multi-model inference endpoint without having to update the endpoint or create a new one.
Clean up
Don’t forget to complete the Delete the hosting resources step in the lab-3 and lab-4 notebooks to delete the SageMaker inference endpoints. You should turn down the SageMaker notebook instance to save costs as well. Finally, you can either call the AWS HealthImaging API function or use the AWS HealthImaging console to delete the image sets and data store created earlier:
for s in imageSetIds.keys():
medicalimaging.deleteImageSet(datastoreId, s)
medicalimaging.deleteDatastore(datastoreId)
Conclusion
In this post, we showed you how to create a MAP connector to AWS HealthImaging, which is reusable in applications built with the MONAI Deploy App SDK, to integrate with and accelerate image data retrieval from a cloud-native DICOM store to medical imaging AI workloads. The MONAI Deploy SDK can be used to support hospital operations. We also demonstrated two hosting options to deploy MAP AI applications on SageMaker at scale.
Go through the example notebooks in the GitHub repository to learn more about how to deploy MONAI applications on SageMaker with medical images stored in AWS HealthImaging. To know what AWS can do for you, contact an AWS representative.
For additional resources, refer to the following:
- Medical Imaging on AWS
- Introducing AWS HealthImaging — purpose-built for medical imaging at scale
- AWS HealthImaging Developer Guide
- Integration of on-premises medical imaging data with AWS HealthImaging
- AWS and NVIDIA
About the Authors
Ming (Melvin) Qin is an independent contributor on the Healthcare team at NVIDIA, focused on developing an AI inference application framework and platform to bring AI to medical imaging workflows. Before joining NVIDIA in 2018 as a founding member of Clara, Ming spent 15 years developing Radiology PACS and Workflow SaaS as lead engineer/architect at Stentor Inc., later acquired by Philips Healthcare to form its Enterprise Imaging.
David Bericat is a product manager for Healthcare at NVIDIA, where he leads the Project MONAI Deploy working group to bring AI from research to clinical deployments. His passion is to accelerate health innovation globally translating it to true clinical impact. Previously, David worked at Red Hat, implementing open source principles at the intersection of AI, cloud, edge computing, and IoT. His proudest moments include hiking to the Everest base camp and playing soccer for over 20 years.
Brad Genereaux is Global Lead, Healthcare Alliances at NVIDIA, where he is responsible for developer relations with a focus in medical imaging to accelerate artificial intelligence and deep learning, visualization, virtualization, and analytics solutions. Brad evangelizes the ubiquitous adoption and integration of seamless healthcare and medical imaging workflows into everyday clinical practice, with more than 20 years of experience in healthcare IT.
Gang Fu is a Healthcare Solutions Architect at AWS. He holds a PhD in Pharmaceutical Science from the University of Mississippi and has over 10 years of technology and biomedical research experience. He is passionate about technology and the impact it can make on healthcare.
JP Leger is a Senior Solutions Architect supporting academic medical centers and medical imaging workflows at AWS. He has over 20 years of expertise in software engineering, healthcare IT, and medical imaging, with extensive experience architecting systems for performance, scalability, and security in distributed deployments of large data volumes on premises, in the cloud, and hybrid with analytics and AI.
Chris Hafey is a Principal Solutions Architect at Amazon Web Services. He has over 25 years’ experience in the medical imaging industry and specializes in building scalable high-performance systems. He is the creator of the popular CornerstoneJS open source project, which powers the popular OHIF open source zero footprint viewer. He contributed to the DICOMweb specification and continues to work towards improving its performance for web-based viewing.
Generative AI in Search expands to more than 120 new countries and territories
Generative AI in Search, or Search Generative Experience (SGE), is expanding around the world, and adding four new languages.Read More
Optimize for sustainability with Amazon CodeWhisperer
This post explores how Amazon CodeWhisperer can help with code optimization for sustainability through increased resource efficiency. Computationally resource-efficient coding is one technique that aims to reduce the amount of energy required to process a line of code and, as a result, aid companies in consuming less energy overall. In this era of cloud computing, developers are now harnessing open source libraries and advanced processing power available to them to build out large-scale microservices that need to be operationally efficient, performant, and resilient. However, modern applications often consist of extensive code, demanding significant computing resources. Although the direct environmental impact might not be obvious, sub-optimized code amplifies the carbon footprint of modern applications through factors like heightened energy consumption, prolonged hardware usage, and outdated algorithms. In this post, we discover how Amazon CodeWhisperer helps address these concerns and reduce the environmental footprint of your code.
Amazon CodeWhisperer is a generative AI coding companion that speeds up software development by making suggestions based on the existing code and natural language comments, reducing the overall development effort and freeing up time for brainstorming, solving complex problems, and authoring differentiated code. Amazon CodeWhisperer can help developers streamline their workflows, enhance code quality, build stronger security postures, generate robust test suites, and write computationally resource friendly code, which can help you optimize for environmental sustainability. It is available as part of the Toolkit for Visual Studio Code, AWS Cloud9, JupyterLab, Amazon SageMaker Studio, AWS Lambda, AWS Glue, and JetBrains IntelliJ IDEA. Amazon CodeWhisperer currently supports Python, Java, JavaScript, TypeScript, C#, Go, Rust, PHP, Ruby, Kotlin, C, C++, Shell scripting, SQL, and Scala.
Impact of unoptimized code on cloud computing and application carbon footprint
AWS’s infrastructure is 3.6 times more energy efficient than the median of surveyed US enterprise data centers and up to 5 times more energy efficient than the average European enterprise data center. Therefore, AWS can help lower the workload carbon footprint up to 96%. You can now use Amazon CodeWhisperer to write quality code with reduced resource usage and energy consumption, and meet scalability objectives while benefiting from AWS energy efficient infrastructure.
Increased resource usage
Unoptimized code can result in the ineffective usage of cloud computing resources. As a result, more virtual machines (VMs) or containers may be required, increasing resource allocation, energy use, and the related carbon footprint of the workload. You might encounter increases in the following:
- CPU utilization – Unoptimized code often contains inefficient algorithms or coding practices that require excessive CPU cycles to run.
- Memory consumption – Inefficient memory management in unoptimized code can result in unnecessary memory allocation, deallocation, or data duplication.
- Disk I/O operations – Inefficient code can perform excessive input/output (I/O) operations. For example, if data is read from or written to disk more frequently than necessary, it can increase disk I/O utilization and latency.
- Network usage – Due to ineffective data transmission techniques or duplicate communication, poorly optimized code may cause an excessive amount of network traffic. This can lead to higher latency and increased network bandwidth utilization. Increased network utilization may result in higher expenses and resource needs in situations where network resources are taxed based on usage, such as in cloud computing.
Higher energy consumption
Infrastructure-supporting applications with inefficient code uses more processing power. Overusing computing resources due to inefficient, bloated code can result in higher energy consumption and heat production, which subsequently necessitates more energy for cooling. Along with the servers, the cooling systems, the infrastructure for power distribution, and other auxiliary elements also consume energy.
Scalability challenges
In application development, scalability issues can be caused by unoptimized code. Such code may not scale effectively as the task grows, necessitating more resources and using more energy. This increases the energy consumed by these code fragments. As mentioned previously, inefficient or wasteful code has a compounding effect at scale.
The compounded energy savings from optimizing code that customers run in certain data centers is even further compounded when we take into consideration that cloud providers such as AWS have dozens of data centers around the world.
Amazon CodeWhisperer uses machine learning (ML) and large language models to provide code recommendations in real time based on the original code and natural language comments, and provides code recommendations that could be more efficient. The program’s infrastructure usage efficiency can be increased by optimizing the code using strategies including algorithmic advancements, effective memory management, and a reduction in pointless I/O operations.
Code generation, completion, and suggestions
Let’s examine several situations where Amazon CodeWhisperer can be useful.
By automating the development of repetitive or complex code, code generation tools minimize the possibility of human error while focusing on platform-specific optimizations. By using established patterns or templates, these programs may produce code that more consistently adheres to sustainability best practices. Developers can produce code that complies with particular coding standards, helping deliver more consistent and dependable code throughout the project. The resulting code may be more efficient and because it removes human coding variations, and can be more legible, improving development speed. It can automatically implement ways to reduce the application program size and length, such as deleting superfluous code, improving variable storage, or using compression methods. These optimizations can aid in memory consumption optimization and boosts overall system efficiency by shrinking the package size.
Generative AI has the potential to make programming more sustainable by optimizing resource allocation. Looking holistically at an application’s carbon footprint is important. Tools like Amazon CodeGuru Profiler can collect performance data to optimize latency between components. The profiling service examines code runs and identifies potential improvements. Developers can then manually refine the auto generated code based on these findings to further improve energy efficiency. The combination of generative AI, profiling, and human oversight creates a feedback loop that can continuously improve code efficiency and reduce environmental impact.
The following screenshot shows you results generated from CodeGuru Profiler in latency mode, which includes network and disk I/O. In this case, the application still spends most of its time in ImageProcessor.extractTasks
(second bottom row), and almost all the time inside that is runnable, which means that it wasn’t waiting for anything. You can view these thread states by changing to latency mode from CPU mode. This can help you get a good idea of what is impacting the wall clock time of the application. For more information, refer to Reducing Your Organization’s Carbon Footprint with Amazon CodeGuru Profiler.
Generating test cases
Amazon CodeWhisperer can help suggest test cases and verify the code’s functionality by considering boundary values, edge cases, and other potential issues that may need to be tested. Also, Amazon CodeWhisperer can simplify creating repetitive code for unit testing. For example, if you need to create sample data using INSERT statements, Amazon CodeWhisperer can generate the necessary inserts based on a pattern. The overall resource requirements for software testing can also be decreased by identifying and optimizing resource-intensive test cases or removing redundant ones. Improved test suites have the potential to make the application become more environmentally friendly by increasing energy efficiency, decreasing resource consumption, minimizing waste, and reducing the workload carbon footprint.
For a more hands-on experience with Amazon CodeWhisperer, refer to Optimize software development with Amazon CodeWhisperer. The post showcases the code recommendations from Amazon CodeWhisperer in Amazon SageMaker Studio. It also demonstrates the suggested code based on comments for loading and analyzing a dataset.
Conclusion
In this post, we learned how Amazon CodeWhisperer can help developers write optimized, more sustainable code. Using advanced ML models, Amazon CodeWhisperer analyzes your code and provides personalized recommendations for improving efficiency, which can reduce costs and help decrease the carbon footprint.
By suggesting minor adjustments and alternative approaches, Amazon CodeWhisperer enables developers to significantly cut resource usage and emissions without sacrificing functionality. Whether you’re looking to optimize an existing code base or ensure new projects are resource efficient, Amazon CodeWhisperer can be an invaluable aid. To learn more about Amazon CodeWhisperer and AWS Sustainability resources for code optimization, consider the following next steps:
- Get started with Amazon CodeWhisperer
- Learn how Amazon CodeWhisperer can accelerate and enhance software development
- Build a serverless app with Amazon CodeWhisperer
- Learn about the AWS Well-Architected Pillar for Sustainability-Software and Architecture Pattern
- After you have updated your code, make sure to review runtime performance data from your live applications, and recommendations that can help you fine-tune your application performance on AWS using AWS CodeGuru Profiler.
About the authors
Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS enterprise customers grow by understanding their goals and challenges, and guides them on how they can architect their applications in a cloud-native manner while ensuring resilience and scalability. She’s passionate about machine learning technologies and environmental sustainability.
Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.
Erick Irigoyen is a Solutions Architect at Amazon Web Services focusing on clients in the Semiconductors and Electronics industry. He works closely with customers to understand their business challenges and identify how AWS can be leveraged to achieve their strategic goals. His work has primarily focused on projects related to Artificial Intelligence and Machine Learning (AI/ML). Prior to joining AWS, he was a Senior Consultant at Deloitte’s Advanced Analytics practice where he led workstreams in several engagements across the United States focusing on Analytics and AI/ML. Erick holds a B.S. in Business from the University of San Francisco and an M.S. in Analytics from North Carolina State University.
Acing the Test: NVIDIA Turbocharges Generative AI Training in MLPerf Benchmarks
NVIDIA’s AI platform raised the bar for AI training and high performance computing in the latest MLPerf industry benchmarks.
Among many new records and milestones, one in generative AI stands out: NVIDIA Eos — an AI supercomputer powered by a whopping 10,752 NVIDIA H100 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand networking — completed a training benchmark based on a GPT-3 model with 175 billion parameters trained on one billion tokens in just 3.9 minutes.
That’s a nearly 3x gain from 10.9 minutes, the record NVIDIA set when the test was introduced less than six months ago.
The benchmark uses a portion of the full GPT-3 data set behind the popular ChatGPT service that, by extrapolation, Eos could now train in just eight days, 73x faster than a prior state-of-the-art system using 512 A100 GPUs.
The acceleration in training time reduces costs, saves energy and speeds time-to-market. It’s heavy lifting that makes large language models widely available so every business can adopt them with tools like NVIDIA NeMo, a framework for customizing LLMs.
In a new generative AI test this round, 1,024 NVIDIA Hopper architecture GPUs completed a training benchmark based on the Stable Diffusion text-to-image model in 2.5 minutes, setting a high bar on this new workload.
By adopting these two tests, MLPerf reinforces its leadership as the industry standard for measuring AI performance, since generative AI is the most transformative technology of our time.
System Scaling Soars
The latest results were due in part to the use of the most accelerators ever applied to an MLPerf benchmark. The 10,752 H100 GPUs far surpassed the scaling in AI training in June, when NVIDIA used 3,584 Hopper GPUs.
The 3x scaling in GPU numbers delivered a 2.8x scaling in performance, a 93% efficiency rate thanks in part to software optimizations.
Efficient scaling is a key requirement in generative AI because LLMs are growing by an order of magnitude every year. The latest results show NVIDIA’s ability to meet this unprecedented challenge for even the world’s largest data centers.
The achievement is thanks to a full-stack platform of innovations in accelerators, systems and software that both Eos and Microsoft Azure used in the latest round.
Eos and Azure both employed 10,752 H100 GPUs in separate submissions. They achieved within 2% of the same performance, demonstrating the efficiency of NVIDIA AI in data center and public-cloud deployments.
NVIDIA relies on Eos for a wide array of critical jobs. It helps advance initiatives like NVIDIA DLSS, AI-powered software for state-of-the-art computer graphics and NVIDIA Research projects like ChipNeMo, generative AI tools that help design next-generation GPUs.
Advances Across Workloads
NVIDIA set several new records in this round in addition to making advances in generative AI.
For example, H100 GPUs were 1.6x faster than the prior-round training recommender models widely employed to help users find what they’re looking for online. Performance was up 1.8x on RetinaNet, a computer vision model.
These increases came from a combination of advances in software and scaled-up hardware.
NVIDIA was once again the only company to run all MLPerf tests. H100 GPUs demonstrated the fastest performance and the greatest scaling in each of the nine benchmarks.
Speedups translate to faster time to market, lower costs and energy savings for users training massive LLMs or customizing them with frameworks like NeMo for the specific needs of their business.
Eleven systems makers used the NVIDIA AI platform in their submissions this round, including ASUS, Dell Technologies, Fujitsu, GIGABYTE, Lenovo, QCT and Supermicro.
NVIDIA partners participate in MLPerf because they know it’s a valuable tool for customers evaluating AI platforms and vendors.
HPC Benchmarks Expand
In MLPerf HPC, a separate benchmark for AI-assisted simulations on supercomputers, H100 GPUs delivered up to twice the performance of NVIDIA A100 Tensor Core GPUs in the last HPC round. The results showed up to 16x gains since the first MLPerf HPC round in 2019.
The benchmark included a new test that trains OpenFold, a model that predicts the 3D structure of a protein from its sequence of amino acids. OpenFold can do in minutes vital work for healthcare that used to take researchers weeks or months.
Understanding a protein’s structure is key to finding effective drugs fast because most drugs act on proteins, the cellular machinery that helps control many biological processes.
In the MLPerf HPC test, H100 GPUs trained OpenFold in 7.5 minutes. The OpenFold test is a representative part of the entire AlphaFold training process that two years ago took 11 days using 128 accelerators.
A version of the OpenFold model and the software NVIDIA used to train it will be available soon in NVIDIA BioNeMo, a generative AI platform for drug discovery.
Several partners made submissions on the NVIDIA AI platform in this round. They included Dell Technologies and supercomputing centers at Clemson University, the Texas Advanced Computing Center and — with assistance from Hewlett Packard Enterprise (HPE) — Lawrence Berkeley National Laboratory.
Benchmarks With Broad Backing
Since its inception in May 2018, the MLPerf benchmarks have enjoyed broad backing from both industry and academia. Organizations that support them include Amazon, Arm, Baidu, Google, Harvard, HPE, Intel, Lenovo, Meta, Microsoft, NVIDIA, Stanford University and the University of Toronto.
MLPerf tests are transparent and objective, so users can rely on the results to make informed buying decisions.
All the software NVIDIA used is available from the MLPerf repository, so all developers can get the same world-class results. These software optimizations get continuously folded into containers available on NGC, NVIDIA’s software hub for GPU applications.
Learn more about MLPerf and the details of this round.
Research Focus: Week of November 8, 2023
NEW RESEARCH
HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations
Generating both plausible and accurate full body avatar motion is essential for creating high quality immersive experiences in mixed reality scenarios. Head-mounted devices (HMDs) typically only provide a few input signals, such as head and hands 6-DoF—or the six degrees of freedom of movement by a rigid body in a three-dimensional space. Recent approaches have achieved impressive performance in generating full body motion given only head and hands signal. However, all known existing approaches rely on full hand visibility. While this is the case when using motion controllers, for example, a considerable proportion of mixed reality experiences do not involve motion controllers and instead rely on egocentric hand tracking. This introduces the challenge of partial hand visibility, owing to the restricted field of view of the HMD.
In a recent paper: HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations, researchers from Microsoft propose HMD-NeMo, the first unified approach that addresses plausible and accurate full body motion generation even when the hands may be only partially visible. HMD-NeMo is a lightweight neural network that predicts full body motion in an online and real-time fashion. At the heart of HMD-NeMo is a spatio-temporal encoder with novel temporally adaptable mask tokens that encourage plausible motion in the absence of hand observations. The researchers perform extensive analysis of the impact of different components in HMD-NeMo and, through their evaluation, introduce a new state-of-the-art on AMASS, a large database of human motion unifying different optical marker-based motion capture datasets.
Microsoft Research Podcast
Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi
Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.
NEW ARTICLE
Will Code Remain a Relevant User Interface for End-User Programming with Generative AI Models?
The research field of end-user programming has largely been concerned with helping non-experts learn to code well enough to achieve their own tasks. Generative AI stands to obviate this entirely by allowing users to generate code from naturalistic language prompts.
In a recent essay: Will Code Remain a Relevant User Interface for End-User Programming with Generative AI Models?, researchers from Microsoft explore the relevance of “traditional” programming languages for non-expert end-user programmers in a world with generative AI. They posit the “generative shift hypothesis”: that generative AI will create qualitative and quantitative expansions in the traditional scope of end-user programming. They outline some reasons that traditional programming languages may still be relevant and useful for end-user programmers, and speculate whether each of these reasons might endure or disappear with further improvements and innovations in generative AI. And finally, they articulate a set of implications for end-user programming research, including the possibility of needing to revisit many well-established core concepts, such as Ko’s learning barriers and Blackwell’s attention investment model.
NEW RESEARCH
LUT-NN: Empower Efficient Neural Network Inference with Centroid Learning and Table Lookup
On-device deep neural network (DNN) inference, widely used in mobile devices such as smartphones and smartwatches, offers unparalleled intelligent services, but also stresses the limited hardware resources on those devices.
In a recent paper: LUT-NN: Empower Efficient Neural Network Inference with Centroid Learning and Table Lookup, researchers at Microsoft propose a system that consumes less latency, memory, disk, and power, for more efficient DNN inference. LUT-NN learns the typical features for each operator, known as the centroid, and precomputes the results for these centroids to save in lookup tables. During inference, the results of the closest centroids with the inputs can be read directly from the table, as the approximated outputs without computations.
LUT-NN integrates two major novel techniques: (1) differentiable centroid learning through backpropagation, which adapts three levels of approximation to minimize the accuracy impact by centroids; (2) table lookup inference execution, which comprehensively considers different levels of parallelism, memory access reduction, and dedicated hardware units for optimal performance.
The post Research Focus: Week of November 8, 2023 appeared first on Microsoft Research.