December 2023 – Page 8

Practices for Governing Agentic AI Systems

OpenAI Blog

Superalignment Fast Grants

We’re launching $10M in grants to support technical research towards the alignment and safety of superhuman AI systems, including weak-to-strong generalization, interpretability, scalable oversight, and more.OpenAI Blog

Weak-to-strong generalization

We present a new research direction for superalignment, together with promising initial results: can we leverage the generalization properties of deep learning to control strong models with weak supervisors?OpenAI Blog

At NeurIPS, what’s old is new again

Amazon Scholar and NeurIPS advisory board member Richard Zemel on what robustness and responsible AI have in common, what AI can still learn from neuroscience, and the emerging topics that interest him most.Read More

How Is AI Used in Fraud Detection?

The Wild West had gunslingers, bank robberies and bounties — today’s digital frontier has identity theft, credit card fraud and chargebacks.

Cashing in on financial fraud has become a multibillion-dollar criminal enterprise. And generative AI in the hands of fraudsters only promises to make this more profitable.

Credit card losses worldwide are expected to reach $43 billion by 2026, according to the Nilson Report.

Financial fraud is perpetrated in a growing number of ways, like harvesting hacked data from the dark web for credit card theft, using generative AI for phishing personal information, and laundering money between cryptocurrency, digital wallets and fiat currencies. Many other financial schemes are lurking in the digital underworld.

To keep up, financial services firms are wielding AI for fraud detection. That’s because many of these digital crimes need to be halted in their tracks in real time so that consumers and financial firms can stop losses right away.

So how is AI used for fraud detection?

AI for fraud detection uses multiple machine learning models to detect anomalies in customer behaviors and connections as well as patterns of accounts and behaviors that fit fraudulent characteristics.

Generative AI Can Be Tapped as Fraud Copilot

Much of financial services involves text and numbers. Generative AI and large language models (LLMs), capable of learning meaning and context, promise disruptive capabilities across industries with new levels of output and productivity. Financial services firms can harness generative AI to develop more intelligent and capable chatbots and improve fraud detection.

On the opposite side, bad actors can circumvent AI guardrails with crafty generative AI prompts to use it for fraud. And LLMs are delivering human-like writing, enabling fraudsters to draft more contextually relevant emails without typos and grammar mistakes. Many different tailored versions of phishing emails can be quickly created, making generative AI an excellent copilot for perpetrating scams. There are also a number of dark web tools like FraudGPT, which can exploit generative AI for cybercrimes.

Generative AI can be exploited for financial harm in voice authentication security measures as well. Some banks are using voice authentication to help authorize users. A banking customer’s voice can be cloned using deep fake technology if an attacker can obtain voice samples in an effort to breach such systems. The voice data can be gathered with spam phone calls that attempt to lure the call recipient into responding by voice.

Chatbot scams are such a problem that the U.S. Federal Trade Commission called out concerns for the use of LLMs and other technology to simulate human behavior for deep fake videos and voice clones applied in imposter scams and financial fraud.

How Is Generative AI Tackling Misuse and Fraud Detection?

Fraud review has a powerful new tool. Workers handling manual fraud reviews can now be assisted with LLM-based assistants running RAG on the backend to tap into information from policy documents that can help expedite decision-making on whether cases are fraudulent, vastly accelerating the process.

LLMs are being adopted to predict the next transaction of a customer, which can help payments firms preemptively assess risks and block fraudulent transactions.

Generative AI also helps combat transaction fraud by improving accuracy, generating reports, reducing investigations and mitigating compliance risk.

Generating synthetic data is another important application of generative AI for fraud prevention. Synthetic data can improve the number of data records used to train fraud detection models and increase the variety and sophistication of examples to teach the AI to recognize the latest techniques employed by fraudsters.

NVIDIA offers tools to help enterprises embrace generative AI to build chatbots and virtual agents with a workflow that uses retrieval-augmented generation. RAG enables companies to use natural language prompts to access vast datasets for information retrieval.

Harnessing NVIDIA AI workflows can help accelerate building and deploying enterprise-grade capabilities to accurately produce responses for various use cases, using foundation models, the NVIDIA NeMo framework, NVIDIA Triton Inference Server and GPU-accelerated vector database to deploy RAG-powered chatbots.

There’s an industry focus on safety to ensure generative AI isn’t easily exploited for harm. NVIDIA released NeMo Guardrails to help ensure that intelligent applications powered by LLMs, such as OpenAI’s ChatGPT, are accurate, appropriate, on topic and secure.

The open-source software is designed to help keep AI-powered applications from being exploited for fraud and other misuses.

What Are the Benefits of AI for Fraud Detection?

Fraud detection has been a challenge across banking, finance, retail and e-commerce. Fraud doesn’t only hurt organizations financially, it can also do reputational harm.

It’s a headache for consumers, as well, when fraud models from financial services firms overreact and register false positives that shut down legitimate transactions.

So financial services sectors are developing more advanced models using more data to fortify themselves against losses financially and reputationally. They’re also aiming to reduce false positives in fraud detection for transactions to improve customer satisfaction and win greater share among merchants.

Financial Services Firms Embrace AI for Identity Verification

The financial services industry is developing AI for identity verification. AI-driven applications using deep learning with graph neural networks (GNNs), natural language processing (NLP) and computer vision can improve identity verification for know-your customer (KYC) and anti-money laundering (AML) requirements, leading to improved regulatory compliance and reduced costs.

Computer vision analyzes photo documentation such as drivers licenses and passports to identify fakes. At the same time, NLP reads the documents to measure the veracity of the data on the documents as the AI analyzes them to look for fraudulent records.

Gains in KYC and AML requirements have massive regulatory and economic implications. Financial institutions, including banks, were fined nearly $5 billion for AML, breaching sanctions as well as failures in KYC systems in 2022, according to the Financial Times.

Harnessing Graph Neural Networks and NVIDIA GPUs

GNNs have been embraced for their ability to reveal suspicious activity. They’re capable of looking at billions of records and identifying previously unknown patterns of activity to make correlations about whether an account has in the past sent a transaction to a suspicious account.

NVIDIA has an alliance with the Deep Graph Library team, as well as the PyTorch Geometric team, which provides a GNN framework containerized offering that includes the latest updates, NVIDIA RAPIDS libraries and more to help users stay up to date on cutting-edge techniques.

These GNN framework containers are NVIDIA-optimized and performance-tuned and tested to get the most out of NVIDIA GPUs.

With access to the NVIDIA AI Enterprise software platform, developers can tap into NVIDIA RAPIDS, NVIDIA Triton Inference Server and the NVIDIA TensorRT software development kit to support enterprise deployments at scale.

Improving Anomaly Detection With GNNs

Fraudsters have sophisticated techniques and can learn ways to outmaneuver fraud detection systems. One way is by unleashing complex chains of transactions to avoid notice. This is where traditional rules-based systems can miss patterns and fail.

GNNs build on a concept of representation within the model of local structure and feature context. The information from the edge and node features is propagated with aggregation and message passing among neighboring nodes.

When GNNs run multiple layers of graph convolution, the final node states contain information from nodes multiple hops away. The larger receptive field of GNNs can track the more complex and longer transaction chains used by financial fraud perpetrators in attempts to obscure their tracks.

GNNs Enable Training Unsupervised or Self-Supervised

Detecting financial fraud patterns at massive scale is challenged by the tens of terabytes of transaction data that needs to be analyzed in the blink of an eye and a relative lack of labeled data for real fraud activity needed to train models.

While GNNs can cast a wider detection net on fraud patterns, they can also train on an unsupervised or self-supervised task.

By using techniques such as Bootstrapped Graph Latents — a graph representation learning method — or link prediction with negative sampling, GNN developers can pretrain models without labels and fine-tune models with far fewer labels, producing strong graph representations. The output of this can be used for models like XGBoost, GNNs or techniques for clustering, offering better results when deployed for inference.

Tackling Model Explainability and Bias

GNNs also enable model explainability with a suite of tools. Explainable AI is an industry practice that enables organizations to use such tools and techniques to explain how AI models make decisions, allowing them to safeguard against bias.

Heterogeneous graph transformer and graph attention network, which are GNN models, enable attention mechanisms across each layer of the GNN, allowing developers to identify message paths that GNNs use to reach a final output.

Even without an attention mechanism, techniques such as GNNExplainer, PGExplainer and GraphMask have been suggested to explain GNN outputs.

Leading Financial Services Firms Embrace AI for Gains

American Express: Improved fraud detection accuracy by 6% with deep learning models and used NVIDIA TensorRT on NVIDIA Triton Inference Server.

BNY Mellon: Bank of New York Mellon improved fraud detection accuracy by 20% with federated learning. BNY built a collaborative fraud detection framework that runs Inpher’s secure multi-party computation, which safeguards third-party data on NVIDIA DGX systems.

PayPal: PayPal sought a new fraud detection system that could operate worldwide continuously to protect customer transactions from potential fraud in real time. The company delivered a new level of service, using NVIDIA GPU-powered inference to improve real-time fraud detection by 10% while lowering server capacity nearly 8x.
Swedbank: Among Sweden’s largest banks, Swedbank trained NVIDIA GPU-driven generative adversarial networks to detect suspicious activities in efforts to stop fraud and money laundering, saving $150 million in a single year.

Learn how NVIDIA AI Enterprise addresses fraud detection at this webinar.

Create summaries of recordings using generative AI with Amazon Bedrock and Amazon Transcribe

Meeting notes are a crucial part of collaboration, yet they often fall through the cracks. Between leading discussions, listening closely, and typing notes, it’s easy for key information to slip away unrecorded. Even when notes are captured, they can be disorganized or illegible, rendering them useless.

In this post, we explore how to use Amazon Transcribe and Amazon Bedrock to automatically generate clean, concise summaries of video or audio recordings. Whether it’s an internal team meeting, conference session, or earnings call, this approach can help you distill hours of content down to salient points.

We walk through a solution to transcribe a project team meeting and summarize the key takeaways with Amazon Bedrock. We also discuss how you can customize this solution for other common scenarios like course lectures, interviews, and sales calls. Read on to simplify and automate your note-taking process.

Solution overview

By combining Amazon Transcribe and Amazon Bedrock, you can save time, capture insights, and enhance collaboration. Amazon Transcribe is an automatic speech recognition (ASR) service that makes it straightforward to add speech-to-text capability to applications. It uses advanced deep learning technologies to accurately transcribe audio into text. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API, along with a broad set of capabilities you need to build generative AI applications. With Amazon Bedrock, you can easily experiment with a variety of top FMs, and privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG).

The solution presented in this post is orchestrated using an AWS Step Functions state machine that is triggered when you upload a recording to the designated Amazon Simple Storage Service (Amazon S3) bucket. Step Functions lets you create serverless workflows to orchestrate and connect components across AWS services. It handles the underlying complexity so you can focus on application logic. It’s useful for coordinating tasks, distributed processing, ETL (extract, transform, and load), and business process automation.

The following diagram illustrates the high-level solution architecture.

The solution workflow includes the following steps:

A user stores a recording in the S3 asset bucket.
This action triggers the Step Functions transcription and summarization state machine.
As part of the state machine, an AWS Lambda function is triggered, which transcribes the recording using Amazon Transcribe and stores the transcription in the asset bucket.
A second Lambda function retrieves the transcription and generates a summary using the Anthropic Claude model in Amazon Bedrock.
Lastly, a final Lambda function uses Amazon Simple Notification Service (Amazon SNS) to send a summary of the recording to the recipient.

This solution is supported in Regions where Anthropic Claude on Amazon Bedrock is available.

The state machine orchestrates the steps to perform the specific tasks. The following diagram illustrates the detailed process.

Prerequisites

Amazon Bedrock users need to request access to models before they are available for use. This is a one-time action. For this solution, you’ll need to enable access to the Anthropic Claude (not Anthropic Claude Instant) model in Amazon Bedrock. For more information, refer to Model access.

Deploy solution resources

The solution is deployed using an AWS CloudFormation template, found on the GitHub repo, to automatically provision the necessary resources in your AWS account. The template requires the following parameters:

Email address used to send summary – The summary will be sent to this address. You must acknowledge the initial Amazon SNS confirmation email before receiving additional notifications.
Summary instructions – These are the instructions given to the Amazon Bedrock model to generate the summary.

Run the solution

After you deploy the solution using AWS CloudFormation, complete the following steps:

Acknowledge the Amazon SNS email confirmation that you should receive a few moments after creating the CloudFormation stack.
On the AWS CloudFormation console, navigate to stack you just created.
On the stack’s Outputs tab, and look for the value associated with AssetBucketName; it will look something like summary-generator-assetbucket-xxxxxxxxxxxxx.
On the Amazon S3 console, navigate to your asset bucket.

This is where you’ll upload your recordings. Valid file formats are MP3, MP4, WAV, FLAC, AMR, OGG, and WebM.

Upload your recording to the recordings folder.

Uploading recordings will automatically trigger the Step Functions state machine. For this example, we use a sample team meeting recording in the sample-recording directory of the GitHub repository.

On the Step Functions console, navigate to the summary-generator state machine.
Choose the name of the state machine run with the status Running.

Here, you can watch the progress of the state machine as it processes the recording.

After it reaches its Success state, you should receive an emailed summary of the recording.

Alternatively, you can navigate to the S3 assets bucket and view the transcript there in the transcripts folder.

Review the summary

You will get the recording summary emailed to the address you provided when you created the CloudFormation stack. If you don’t receive the email in a few moments, make sure that you acknowledged the Amazon SNS confirmation email that you should have received after you created the stack and then upload the recording again, which will trigger the summary process.

This solution includes a mock team meeting recording that you can use to test the solution. The summary will look similar to the following example. Because of the nature of generative AI, however, your output will look a bit different, but the content should be close.

Here are the key points from the standup:

Joe finished reviewing the current state for task EDU1 and created a new task to develop the future state. That new task is in the backlog to be prioritized. He’s now starting EDU2 but is blocked on resource selection.

Rob created a tagging strategy for SLG1 based on best practices, but may need to coordinate with other teams who have created their own strategies, to align on a uniform approach. A new task was created to coordinate tagging strategies.

Rob has made progress debugging for SLG2 but may need additional help. This task will be moved to Sprint 2 to allow time to get extra resources.

Next Steps:

Joe to continue working on EDU2 as able until resource selection is decided

New task to be prioritized to coordinate tagging strategies across teams

SLG2 moved to Sprint 2

Standups moving to Mondays starting next week

Expand the solution

Now that you have a working solution, here are some potential ideas to customize the solution for your specific use cases:

Try altering the process to fit your available source content and desired outputs:
- For situations where transcripts are available, create an alternate Step Functions workflow to ingest existing text-based or PDF-based transcriptions.
- Instead of using Amazon SNS to notify recipients via email, you can use it to send the output to a different endpoint, such as a team collaboration site, or to the team’s chat channel.
Try changing the summary instructions CloudFormation stack parameter provided to Amazon Bedrock to produce outputs specific to your use case (this is the generative AI prompt):
- When summarizing a company’s earnings call, you could have the model focus on potential promising opportunities, areas of concern, and things that you should continue to monitor.
- If you are using this to summarize a course lecture, the model could identify upcoming assignments, summarize key concepts, list facts, and filter out any small talk from the recording.
For the same recording, create different summaries for different audiences:
- Engineers’ summaries focus on design decisions, technical challenges, and upcoming deliverables.
- Project managers’ summaries focus on timelines, costs, deliverables, and action items.
- Project sponsors get a brief update on project status and escalations.
- For longer recordings, try generating summaries for different levels of interest and time commitment. For example, create a single sentence, single paragraph, single page, or in-depth summary. In addition to the prompt, you may want to adjust the max_tokens_to_sample parameter to accommodate different content lengths.

Clean up

To clean up the solution, delete the CloudFormation stack that you created earlier. Note that deleting the stack will not delete the asset bucket. If you no longer need the recordings or transcripts, you can delete this bucket separately. Amazon Transcribe will automatically delete transcription jobs after 90 days, but you can delete these manually before then.

Conclusion

In this post, we explored how to use Amazon Transcribe and Amazon Bedrock to automatically generate clean, concise summaries of video or audio recordings. We encourage you to continue evaluating Amazon Bedrock, Amazon Transcribe, and other AWS AI services, like Amazon Textract, Amazon Translate, and Amazon Rekognition, to see how they can help meet your business objectives.

About the Authors

Rob Barnes is a principal consultant for AWS Professional Services. He works with our customers to address security and compliance requirements at scale in complex, multi-account AWS environments through automation.

Jason Stehle is a Senior Solutions Architect at AWS, based in the New England area. He works with customers to align AWS capabilities with their greatest business challenges. Outside of work, he spends his time building things and watching comic book movies with his family.

Fine-tune Llama 2 using QLoRA and Deploy it on Amazon SageMaker with AWS Inferentia2

In this post, we showcase fine-tuning a Llama 2 model using a Parameter-Efficient Fine-Tuning (PEFT) method and deploy the fine-tuned model on AWS Inferentia2. We use the AWS Neuron software development kit (SDK) to access the AWS Inferentia2 device and benefit from its high performance. We then use a large model inference container powered by Deep Java Library (DJLServing) as our model serving solution.

Solution overview

Efficient Fine-tuning Llama2 using QLoRa

The Llama 2 family of large language models (LLMs) is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Llama 2 was pre-trained on 2 trillion tokens of data from publicly available sources. AWS customers sometimes choose to fine-tune Llama 2 models using customers’ own data to achieve better performance for downstream tasks. However, due to Llama 2 model’s large number of parameters, full fine-tuning could be prohibitively expensive and time consuming. Parameter-Efficient Fine-Tuning (PEFT) approach can address this problem by only fine-tune a small number of extra model parameters while freezing most parameters of the pre-trained model. For more information on PEFT, one can read this post. In this post, we use QLoRa to fine-tune a Llama 2 7B model.

Deploy a fine-tuned Model on Inf2 using Amazon SageMaker

AWS Inferentia2 is purpose-built machine learning (ML) accelerator designed for inference workloads and delivers high-performance at up to 40% lower cost for generative AI and LLM workloads over other inference optimized instances on AWS. In this post, we use Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instance, featuring AWS Inferentia2, the second generation Inferentia2 accelerators, each containing two NeuronCores-v2. Each NeuronCore-v2 is an independent, heterogenous compute-unit, with four main engines: Tensor, Vector, Scalar, and GPSIMD engines. It includes an on-chip software-managed SRAM memory for maximizing data locality. Since several blogs on Inf2 has been published, the reader can refer to this post and our documentation for more information on Inf2.

To deploy models on Inf2, we need AWS Neuron SDK as the software layer running on top of the Inf2 hardware. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and AWS Trainium based instances. It enables end-to-end ML development lifecycle to build new models, train and optimize these models, and deploy them for production. AWS Neuron includes a deep learning compiler, runtime, and tools that are natively integrated with popular frameworks like TensorFlow and PyTorch. In this blog, we are going to use transformers-neuronx, which is part of the AWS Neuron SDK for transformer decoder inference workflows. It supports a range of popular models, including Llama 2.

To deploy models on Amazon SageMaker, we usually use a container that contains the required libraries, such as Neuron SDK and transformers-neuronx as well as the model serving component. Amazon SageMaker maintains deep learning containers (DLCs) with popular open source libraries for hosting large models. In this post, we use the Large Model Inference Container for Neuron. This container has everything you need to deploy your Llama 2 model on Inf2. For resources to get started with LMI on Amazon SageMaker, please refer to many of our existing posts (blog 1, blog 2, blog 3) on this topic. In short, you can run the container without writing any additional code. You can use the default handler for a seamless user experience and pass in one of the supported model names and any load time configurable parameters. This compiles and serve an LLM on an Inf2 instance. For example, to deploy OpenAssistant/llama2-13b-orca-8k-3319, you can provide the follow configuration (as serving.properties file). In serving.properties, we specify the model type as llama2-13b-orca-8k-3319, the batch size as 4, the tensor parallel degree as 2, and that is it. For the full list of configurable parameters, refer to All DJL configuration options.

# Engine to use: MXNet, PyTorch, TensorFlow, ONNX, PaddlePaddle, DeepSpeed, etc.
engine = Python 
# default handler for model serving
option.entryPoint = djl_python.transformers_neuronx
# The Hugging Face ID of a model or the s3 url of the model artifacts. 
option.model_id = meta-llama/Llama-2-7b-chat-hf
#the dynamic batch size, default is 1.
option.batch_size=4
# This option specifies number of tensor parallel partitions performed on the model.
option.tensor_parallel_degree=2
# The input sequence length
option.n_positions=512
#Enable iteration level batching using one of "auto", "scheduler", "lmi-dist"
option.rolling_batch=auto
# The data type to which you plan to cast the model default
option.dtype=fp16
# worker load model timeout
option.model_loading_timeout=1500

Alternatively, you can write your own model handler file as shown in this example, but that requires implementing the model loading and inference methods to serve as a bridge between the DJLServing APIs.

Prerequisites

The following list outlines the prerequisites for deploying the model described in this blog post. You can implement either from the AWS Management Console or using the latest version of the AWS Command Line Interface (AWS CLI).

Walkthrough

In the following section, we’ll walkthrough the code in two parts:

Fine-tuning a Llama2-7b model, and upload the model artifacts to a specified Amazon S3 bucket location.
Deploy the model into an Inferentia2 using DJL serving container hosted in Amazon SageMaker.

The complete code samples with instructions can be found in this GitHub repository.

Part 1: Fine-tune a Llama2-7b model using PEFT

We are going to use the recently introduced method in the paper QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during fine-tuning, without sacrificing performance.

Note: The fine-tuning of llama2-7b model shown in the following was tested on an Amazon SageMaker Studio Notebook with Python 2.0 GPU Optimized Kernel using a ml.g5.2xlarge instance type. As a best practice, we recommend using an Amazon SageMaker Studio Integrated Development Environment (IDE) launched in your own Amazon Virtual Private Cloud (Amazon VPC). This allows you to control, monitor, and inspect network traffic within and outside your VPC using standard AWS networking and security capabilities. For more information, see Securing Amazon SageMaker Studio connectivity using a private VPC.

Quantize the base model

We first load a quantized model with 4-bit quantization using Huggingface transformers library as follows:

# The base pretrained model for fine-tuning
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

#Activate 4-bit precision base model loading
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)

# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map
)
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

Load training dataset

Next, we load the dataset to feed the model for fine-tuning step shown as followed:

# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

Attach an adapter layer

Here we attach a small, trainable adapter layer, configured as LoraConfig defined in the Hugging Face’s peft library.

# include linear layers to apply LoRA to.
modules = find_all_linear_names(model)

## Setting up LoRA configuration
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
target_modules=modules)

Train a model

Using the LoRA configuration shown above, we’ll fine-tune the Llama2 model along with hyper-parameters. A code snippet for training the model is shown in the following:

# Set training parameters
training_arguments = TrainingArguments(...)

trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config, # LoRA config
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

Merge model weight

The fine-tuned model executed above created a new model containing the trained LoRA adapter weights. In the following code snippet, we’ll merge the adapter with the base model so that we could use the fine-tuned model for inference.

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

save_dir = "merged_model"
model.save_pretrained(save_dir, safe_serialization=True, max_shard_size="2GB")

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.save_pretrained(save_dir)

Upload model weight to Amazon S3

In the final step of part 1, we’ll save the merged model weights to a specified Amazon S3 location. The model weight will be used by a model serving container in Amazon SageMaker to host the model using an Inferentia2 instance.

model_data_s3_location = "s3://<bucket_name>/<prefix>/"
!cd {save_dir} && aws s3 cp —recursive . {model_data_s3_location}

Part 2: Host QLoRA model for inference with AWS Inf2 using SageMaker LMI Container

In this section, we’ll walk through the steps of deploying a QLoRA fine-tuned model into an Amazon SageMaker hosting environment. We’ll use a DJL serving container from SageMaker DLC, which integrates with the transformers-neuronx library to host this model. The setup facilitates the loading of models onto AWS Inferentia2 accelerators, parallelizes the model across multiple NeuronCores, and enables serving via HTTP endpoints.

Prepare model artifacts

DJL supports many deep learning optimization libraries, including DeepSpeed, FasterTransformer and more. For model specific configurations, we provide a serving.properties with key parameters, such as tensor_parallel_degree and model_id to define the model loading options. The model_id could be a Hugging Face model ID, or an Amazon S3 path where the model weights are stored. In our example, we provide the Amazon S3 location of our fine-tuned model. The following code snippet shows the properties used for the model serving:

%%writefile serving.properties
engine=Python
option.entryPoint=djl_python.transformers_neuronx
option.model_id=<model data s3 location>
option.batch_size=4
option.neuron_optimize_level=2
option.tensor_parallel_degree=8
option.n_positions=512
option.rolling_batch=auto
option.dtype=fp16
option.model_loading_timeout=1500

Please refer to this documentation for more information about the configurable options available via serving.properties. Please note that we use option.n_position=512 in this blog for faster AWS Neuron compilation. If you want to try larger input token length, then we recommend the reader to pre-compile the model ahead of time (see AOT Pre-Compile Model on EC2). Otherwise, you might run into timeout error if the compilation time is too much.

After the serving.properties file is defined, we’ll package the file into a tar.gz format, as follows:

%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

Then, we’ll upload the tar.gz to an Amazon S3 bucket location:

s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

Create an Amazon SageMaker model endpoint

To use an Inf2 instance for serving, we use an Amazon SageMaker LMI container with DJL neuronX support. Please refer to this post for more information about using a DJL NeuronX container for inference. The following code shows how to deploy a model using Amazon SageMaker Python SDK:

# Retrieves the DJL-neuronx docker image URI
image_uri = image_uris.retrieve(
framework="djl-neuronx",
region=sess.boto_session.region_name,
version="0.24.0"
)

# Define inf2 instance type to use for serving
instance_type = "ml.inf2.48xlarge"

endpoint_name = sagemaker.utils.name_from_base("lmi-model")

# Deploy the model for inference
model.deploy(initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=1500,
volume_size=256,
endpoint_name=endpoint_name)

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
endpoint_name=endpoint_name,
sagemaker_session=sess,
serializer=serializers.JSONSerializer(),
)

Test model endpoint

After the model is deployed successfully, we can validate the endpoint by sending a sample request to the predictor:

prompt="What is machine learning?"
input_data = f"<s>[INST] <<SYS>>nAs a data scientistn<</SYS>>n{prompt} [/INST]"

response = predictor.predict(
{"inputs": input_data, "parameters": {"max_new_tokens":300, "do_sample":"True"}}
)

print(json.loads(response)['generated_text'])

The sample output is shown as follows:

In the context of data analysis, Machine Learning (ML) refers to a statistical technique capable of extracting predictive power from a dataset with an increasing complexity and accuracy by iteratively narrowing down the scope of a statistic.

Machine Learning is not a new statistical technique, but rather a combination of existing techniques. Furthermore, it has not been designed to be used with a specific dataset or to produce a specific outcome. Rather, it was designed to be flexible enough to adapt to any dataset and to make predictions about any outcome.

Clean up

If you decide that you no longer want to keep the SageMaker endpoint running, you can delete it using AWS SDK for Python (boto3), AWS CLI or Amazon SageMaker Console. Additionally, you can also shutdown the Amazon SageMaker Studio Resources that are no longer required.

Conclusion

In this post, we showed you how to fine-tune a Llama2-7b model using LoRA adaptor with 4-bit quantization using a single GPU instance. Then we deployed the model to an Inf2 instance hosted in Amazon SageMaker using a DJL serving container. Finally, we validated the Amazon SageMaker model endpoint with a text generation prediction using the SageMaker Python SDK. Go ahead and give it a try, we love to hear your feedback. Stay tuned for updates on more capabilities and new innovations with AWS Inferentia.

For more examples about AWS Neuron, see aws-neuron-samples.

About the Authors

Wei Teh is a Senior AI/ML Specialist Solutions Architect at AWS. He is passionate about helping customers advance their AWS journey, focusing on Amazon Machine Learning services and machine learning-based solutions. Outside of work, he enjoys outdoor activities like camping, fishing, and hiking with his family.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Pie From the Sky: Drone Startup Delivers Pizza, Meds and Side of Excitement

Zipline isn’t just some pie-in-the-sky drone startup.

The San Francisco-based company has completed more than 800,000 deliveries in seven countries since its start in 2011. It recently added services for Seattle’s Pagliacci Pizza, vitamin and supplement giant GNC, and large health systems like Intermountain Health, OhioHealth and Michigan Medicine.

Zipline developed its drones — which have now flown more than 55 million miles — for autonomous navigation and precision landings using the NVIDIA Jetson edge AI and robotics platform.

The fast-growing company recently landed $330 million in funding at a more than $4 billion valuation.

Zipline is a member of NVIDIA Inception, a program that provides startups with technological support and AI platform guidance.

Delivering With Jetson-Powered Fleets

The company’s P1 drone, or platform one, has been deployed in production for seven years and currently uses the Jetson Xavier NX system-on-module to process its sensor inputs. It’s guided by GPS, air traffic control communications, inertial measurement unit sensors and its onboard detection and avoidance system, with redundancy of guidance for safety.

“The NVIDIA Jetson module in the wing is part of what delivers our acoustic detection and avoidance system, so it allows us to listen for other aircraft in the airspace around us and plot trajectories that avoid any conflict,” said A.J. Frantz, navigation lead at Zipline.

The company’s fixed-wing drones can fly out more than 55 miles, at 70 miles per hour, for deliveries from one of several Zipline distribution centers and then return. Capable of hauling up to four pounds of cargo, they autonomously fly over delivery locations and release packages that float down to their destination by parachute.

The company’s P2, or platform two, is a hybrid drone that can fly fast on fixed-wing flights — but also hover. It can carry eight pounds of cargo for 10 miles and packs a droid that can be lowered on a tether to complete deliveries with precision placement. It’s intended for use in dense, urban environments.

The P2 uses two Jetson Orin NX modules. One is for the drone’s sensor fusion system to understand environments. The other is in the droid that descends by tether — for redundancy to provide added safety.

“The P2 droid is about bringing the smallest, quickest, safest, quietest drone in for delivery, coming down precisely and leaving the package — and then going back up,” said Joseph Mardall, head of engineering at Zipline. “We want to integrate into people’s lives in a way that they love and that feels magical.”

Zipline completes one delivery every 70 seconds globally.

Flying Away With a Roster of Customers

Zipline’s service offers advantages that are attracting customers. Its drones, fondly nicknamed ‘Zips,’ are capable of 7x faster delivery times compared with vehicle deliveries, according to the company.

“Our aircraft fly at 70 miles per hour, as the crow flies, so no traffic, no waiting at lights — we’re talking minutes here in terms of delivery times,” said Mardall. “Single-digit minutes are common for deliveries, so it’s faster than any alternative, for sure.”

In addition to services for pizza, vitamins and courier meds, Zipline works with Walmart, restaurant chain Sweetgreen, Michigan Medicine, MultiCare Health Systems, Intermountain Health and the government of Rwanda, among others. It also delivers to more than 4,000 hospitals and health centers, according to the company.

Zipline started its service delivering blood in Rwanda seven years ago and later expanded into food and convenience.

Riding Jetson Orin for Energy Efficiency, Environmental Benefits

Delivering energy-efficient computing is mission-critical for the run-time of autonomous machines, used in everything from delivery services and agriculture to mining and undersea exploration. NVIDIA Jetson Orin modules offer up to 275 trillion operations per second while providing market-leading energy efficiency.

“You can pick the right place for your algorithms to run to make sure you’re getting the most out of the hardware and the power that you are putting into the system,” said Frantz.

Startups using Jetson-driven applications are also leading the way in sustainability, as more next-generation electric-driven autonomous machines replace those contributing to pollution.

Deliveries by Zipline offer a 97% reduction in carbon emissions compared with gasoline-driven vehicles, according to the company.

“We are super excited to significantly reduce carbon emissions,” said Mardall. “And when building an electric aircraft, efficiency is totally key — every watt, every fraction of a watt, every joule that we can claw back can be turned into payload and range.”

Learn more about NVIDIA Jetson Orin.

Build an end-to-end MLOps pipeline using Amazon SageMaker Pipelines, GitHub, and GitHub Actions

Machine learning (ML) models do not operate in isolation. To deliver value, they must integrate into existing production systems and infrastructure, which necessitates considering the entire ML lifecycle during design and development. ML operations, known as MLOps, focus on streamlining, automating, and monitoring ML models throughout their lifecycle. Building a robust MLOps pipeline demands cross-functional collaboration. Data scientists, ML engineers, IT staff, and DevOps teams must work together to operationalize models from research to deployment and maintenance. With the right processes and tools, MLOps enables organizations to reliably and efficiently adopt ML across their teams.

Although the requirements of continuous integration and continuous delivery (CI/CD) pipelines can be unique and reflect each organization’s needs, scaling MLOps practices across teams can be simplified by using managed orchestrations and tools that can accelerate the development process and remove the undifferentiated heavy lifting.

Amazon SageMaker MLOps is a suite of features that includes Amazon SageMaker Projects (CI/CD), Amazon SageMaker Pipelines and Amazon SageMaker Model Registry.

SageMaker Pipelines allows for straightforward creation and management of ML workflows, while also offering storage and reuse capabilities for workflow steps. The SageMaker Model Registry centralizes model tracking, simplifying model deployment. SageMaker Projects introduces CI/CD practices to ML, including environment parity, version control, testing, and automation. This allows for a quick establishment of CI/CD in your ML environment, facilitating effective scalability throughout your enterprise.

The built-in project templates provided by Amazon SageMaker include integration with some of third-party tools, such as Jenkins for orchestration and GitHub for source control, and several utilize AWS native CI/CD tools such as AWS CodeCommit, AWS CodePipeline, and AWS CodeBuild. In many scenarios, however, customers would like to integrate SageMaker Pipelines with other existing CI/CD tools and therefore, create their custom project templates.

In this post, we show you a step-by-step implementation to achieve the following:

Create a custom SageMaker MLOps project template that integrates with GitHub and GitHub Actions
Make your custom project templates available in Amazon SageMaker Studio for your data science team with one-click provisioning

Solution overview

In this post, we construct the following architecture. We create an automated model build pipeline that includes steps for data preparation, model training, model evaluation, and registration of the trained model in the SageMaker Model Registry. The resulting trained ML model is then deployed from the SageMaker Model Registry to staging and production environments upon manual approval.

Let’s delve into the elements of this architecture to understand the complete configuration.

GitHub and GitHub Actions

GitHub is a web-based platform that provides version control and source code management using Git. It enables teams to collaborate on software development projects, track changes, and manage code repositories. GitHub serves as a centralized location to store, version, and manage your ML code base. This ensures that your ML code base and pipelines are versioned, documented, and accessible by team members.

GitHub Actions is a powerful automation tool within the GitHub ecosystem. It allows you to create custom workflows that automate your software development lifecycle processes, such as building, testing, and deploying code. You can create event-driven workflows triggered by specific events, like when code is pushed to a repository or a pull request is created. When implementing MLOps, you can use GitHub Actions to automate various stages of the ML pipeline, such as:

Data validation and preprocessing
Model training and evaluation
Model deployment and monitoring
CI/CD for ML models

With GitHub Actions, you can streamline your ML workflows and ensure that your models are consistently built, tested, and deployed, leading to more efficient and reliable ML deployments.

In the following sections, we start by setting up the prerequisites relating to some of the components that we use as part of this architecture:

AWS CloudFormation – AWS CloudFormation initiates the model deployment and establishes the SageMaker endpoints after the model deployment pipeline is activated by the approval of the trained model.
AWS CodeStar connection – We use AWS CodeStar to establish a link with the GitHub repository and utilize it as code repo integration with AWS resources, like SageMaker Studio.
Amazon EventBridge – Amazon EventBridge keeps track of all modifications to the model registry. It also maintains a rule that prompts the Lambda function to deploy the model pipeline when the status of the model package version changes from PendingManualApproval to Approved within the model registry.
AWS Lambda – We use an AWS Lambda function to initiate the model deployment workflow in GitHub Actions after a new model is registered in the model registry.
Amazon SageMaker – We configure the following SageMaker components:
- Pipeline – This component consists of a directed acyclic graph (DAG) that helps us build the automated ML workflow for the stages of data preparation, model training, and model evaluation. The model registry maintains records of model versions, their associated artifacts, lineage, and metadata. A model package group is established that houses all related model versions. The model registry is also responsible for managing the approval status of the model version for subsequent deployment.
- Endpoint – This component sets up two HTTPS real-time endpoints for inference. The hosting configuration can be adjusted, for instance, for batch transform or asynchronous inference. The staging endpoint is generated when the model deployment pipeline is activated by the approval of the trained model from the SageMaker Model Registry. This endpoint is utilized to validate the deployed model by ensuring it provides predictions that satisfy our accuracy standards. When the model is prepared for production deployment, a production endpoint is deployed by a manual approval stage in the GitHub Actions workflow.
- Code repository – This creates a Git repository as a resource in your SageMaker account. Using the existing data from the GitHub code repository that you input during the creation of your SageMaker project, an association with the same repository is established in SageMaker when you initiate the project. This essentially forms a link with a GitHub repository in SageMaker, enabling interactive actions (pull/push) with your repository.
- Model registry – This monitors the various versions of the model and the corresponding artifacts, which includes lineage and metadata. A collection known as a model package group is created, housing related versions of the model. Moreover, the model registry oversees the approval status of the model version, ensuring its readiness for subsequent deployment.
AWS Secrets Manager – To securely preserve your GitHub personal access token, it’s necessary to establish a secret in AWS Secrets Manager and house your access token within it.
AWS Service Catalog – We use the AWS Service Catalog for the implementation of SageMaker projects, which include components like a SageMaker code repository, Lambda function, EventBridge rule, artifact S3 bucket, etc., all implemented via CloudFormation. This allows your organization to use project templates repeatedly, allocate projects to each user, and streamline operations.
Amazon S3 – We use an Amazon Simple Storage Service (Amazon S3) bucket to keep the model artifacts produced by the pipeline.

Prerequisites

You should have the following prerequisites:

A GitHub account.
An AWS account.
A SageMaker Studio domain
The AWS Command Line Interface (AWS CLI) installed and configured. Alternatively, use AWS CloudShell.

You must also complete additional setup steps before implementing the solution.

Set up an AWS CodeStar connection

If you don’t already have an AWS CodeStar connection to your GitHub account, refer to Create a connection to GitHub for instructions to create one. Your AWS CodeStar connection ARN will look like this:

arn:aws:codestar-connections:us-west-2:account_id:connection/aEXAMPLE-8aad-4d5d-8878-dfcab0bc441f

In this example, aEXAMPLE-8aad-4d5d-8878-dfcab0bc441f is the unique ID for this connection. We use this ID when we create our SageMaker project later in this example.

Set up secret access keys for your GitHub token

To securely store your GitHub personal access token, you need to create a secret in Secrets Manager. If you don’t have a personal access token for GitHub, refer to Managing your personal access tokens for instructions to create one.

You can create either a classic or fine-grained access token. However, make sure that the token has access to the repository’s contents and actions (workflows, runs, and artifacts).

Complete the following steps to store your token in Secrets Manager:

On the Secrets Manager console, choose Store a new secret.
Select Other type of secret for Choose secret type.
Provide a name for your secret in the Key field and add your personal access token to the corresponding Value field.
Choose Next, enter a name for your secret, and choose Next again.
Choose Store to save your secret.

By storing your GitHub personal access token in Secrets Manager, you can securely access it within your MLOps pipeline while ensuring its confidentiality.

Create an IAM user for GitHub Actions

To allow GitHub Actions to deploy SageMaker endpoints in your AWS environment, you need to create an AWS Identity and Access Management (IAM) user and grant it the necessary permissions. For instructions, refer to Creating an IAM user in your AWS account. Use the iam/GithubActionsMLOpsExecutionPolicy.json file (provided in the code sample) to provide sufficient permissions for this user to deploy your endpoints.

After you create the IAM user, generate an access key. You will use this key, which consists of both an access key ID and a secret access key, in the subsequent step when configuring your GitHub secrets.

Set up your GitHub account

The following are the steps to prepare your GitHub account to run this example.

Clone the GitHub repository

You can reuse an existing GitHub repo for this example. However, it’s easier if you create a new repository. This repository is going to contain all the source code for both SageMaker pipeline builds and deployments.

Copy the contents of the seed code directory into the root of your GitHub repository. For instance, the .github directory should be under the root of your GitHub repository.

Create a GitHub secret containing your IAM user access key

In this step, we store the access key details of the newly created user in our GitHub secret.

On the GitHub website, navigate to your repository and choose Settings.
In the security section, select Secrets and Variables and choose Actions.
Choose New Repository Secret.
For Name, enter AWS_ACCESS_KEY_ID
For Secret, enter the access key ID associated with the IAM user you created earlier.
Choose Add Secret.
Repeat the same procedure for AWS_SECRET_ACCESS_KEY

Configure your GitHub environments

To create a manual approval step in our deployment pipelines, we use a GitHub environment. Complete the following steps:

Navigate to the Settings, Environments menu of your GitHub repository and create a new environment called production.
For Environment protection rules, select Required reviewers.
Add the desired GitHub user names as reviewers. For this example, you can choose your own user name.

Note that the environment feature is not available in some types of GitHub plans. For more information, refer to Using environments for deployment.

Deploy the Lambda function

In the following steps, we compress lambda_function.py into a .zip file, which is then uploaded to an S3 bucket.

The relevant code sample for this can be found in the following GitHub repo. Specifically, the lambda_function.py is located in the lambda_functions/lambda_github_workflow_trigger directory.

It’s recommended to create a fork of the code sample and clone that instead. This will give you the freedom to modify the code and experiment with different aspects of the sample.

After you obtain a copy of the code, navigate to the appropriate directory and use the zip command to compress lambda_function.py. Both Windows and MacOS users can use their native file management system, File Explorer or Finder, respectively, to generate a .zip file.

cd lambda_functions/lambda_github_workflow_trigger
zip lambda-github-workflow-trigger.zip lambda_function.py

Upload the lambda-github-workflow-trigger.zip to an S3 bucket.

This bucket will later be accessed by Service Catalog. You can choose any bucket that you have access to, as long as Service Catalog is able to retrieve data from it in subsequent steps.

From this step onwards, we require the AWS CLI v2 to be installed and configured. An alternative would be to utilize AWS CloudShell, which comes with all necessary tools pre-installed, eliminating the need for any additional configurations.

To upload the file to the S3 bucket, use the following command:

aws s3 cp lambda-github-workflow-trigger.zip s3://your-bucket/

Now we construct a Lambda layer for the dependencies related to the lambda_function we just uploaded.

Set up a Python virtual environment and get the dependencies installed:

mkdir lambda_layer
cd lambda_layer
python3 -m venv .env
source .env/bin/activate
pip install pygithub
deactivate

Generate the .zip file with the following commands:

mv .env/lib/python3.9/site-packages/ python
zip -r layer.zip python

Publish the layer to AWS:

aws lambda publish-layer-version --layer-name python39-github-arm64  
  --description "Python3.9 pygithub"  
  --license-info "MIT"  
  --zip-file fileb://layer.zip  
  --compatible-runtimes python3.9  
  --compatible-architectures "arm64"

With this layer published, all your Lambda functions can now reference it to meet their dependencies. For a more detailed understanding of Lambda layers, refer to Working with Lambda layers.

Create a custom project template in SageMaker

After completion of all the above steps, we have all the CI/CD pipeline resources and components. Next we demonstrate how we can make these resources available as a custom project within the SageMaker Studio accessible via one click deployment.

As discussed earlier, when the SageMaker-provided templates don’t meet your needs (for example, you want to have more complex orchestration in CodePipeline with multiple stages, custom approval steps or to integrate with a third party tool such as GitHub and GitHub actions demonstrated in this post), you can create your own templates. We recommend starting with the SageMaker-provided templates to understand how to organize your code and resources and build on top of it. For more details, refer to Create Custom Project Templates.

Note that you can also automate this step and instead use the CloudFormation to deploy the Service Catalogue portfolio and product via code. In this post however, for a greater learning experience, we show you the console deployment.

At this stage, we use the provided CloudFormation template to create a Service Catalog portfolio that helps us create custom projects in SageMaker.

You can create a new domain or reuse your SageMaker domain for the following steps. If you don’t have a domain, refer to Onboard to Amazon SageMaker Domain using Quick setup for setup instructions.

After you enable administrator access to the SageMaker templates, complete the following steps:

On the Service Catalog console, under Administration in the navigation pane, choose Portfolios.
Choose Create a new portfolio.
Name the portfolio “SageMaker Organization Templates”.
Download the template.yml file to your computer.

This Cloud Formation template provisions all the CI/CD resources we need as configuration and infrastructure as code. You can study the template in more detail to see what resources are deployed as part of it. This template has been customized to integrate with GitHub and GitHub Actions.

In the template.yml file, change the S3Bucket value to your bucket where you have uploaded the Lambda .zip file:

GitHubWorkflowTriggerLambda:
  ...
  Code:
    S3Bucket: <your-bucket>
    S3Key: lambda-github-workflow-trigger.zip
  ...

Choose the new portfolio.
Choose Upload a new product.
For Product name¸ enter a name for your template. We use the name build-deploy-github.
For Description, enter a description.
For Owner, enter your name.
Under Version details, for Method, choose Use a template file.
Choose Upload a template.
Upload the template you downloaded.
For Version title, choose 1.0.
Choose Review.
Review your settings and choose Create product.
Choose Refresh to list the new product.
Choose the product you just created.
On the Tags tab, add the following tag to the product:
- Key =sagemaker:studio-visibility
- Value = true

Back in the portfolio details, you should see something similar to the following screenshot (with different IDs).

On the Constraints tab, choose Create constraint.
For Product, choose build-deploy-github (the product you just created).
For Constraint type, choose Launch.
Under Launch constraint, for Method, choose Select IAM role.
Choose AmazonSageMakerServiceCatalogProductsLaunchRole.
Choose Create.
On the Groups, roles, and users tab, choose Add groups, roles, users.
On the Roles tab, select the role you used when configuring your SageMaker Studio domain. This is where the SageMaker domain role can be found.

Choose Add access.

Deploy the project from SageMaker Studio

In the previous sections, you prepared the custom MLOps project environment. Now, let’s create a project using this template:

On the SageMaker console, navigate to the domain that you want to create this project.
On the Launch menu, choose Studio.

You’ll be redirected to the SageMaker Studio environment.

In SageMaker Studio, in the navigation pane under Deployments, choose Projects.
Choose Create project.
At the top of the list of templates, choose Organization templates.

If you have gone through all the previous steps successfully, you should be able to see a new custom project template named Build-Deploy-GitHub.

Select that template and choose Select Project Template.
Enter an optional description.
For GitHub Repository Owner Name, enter the owner of your GitHub repository. For example, if your repository is at https://github.com/pooyavahidi/my-repo, the owner would be pooyavahidi.
For GitHub Repository Name, enter the name of the repository into which you copied the seed code. It would be just the name of the repo. For example, in https://github.com/pooyavahidi/my-repo, the repo is my-repo.
For Codestar connection unique ID, enter the unique ID of the AWS CodeStar connection that you created.
For Name of the secret in the Secrets Manager which stores GitHub token, enter the name of the secret in Secrets Manager where you created and stored the GitHub token.
For GitHub workflow file for deployment, enter the name of the GitHub workflow file (at .github/workflows/deploy.yml) where you have the deployment instructions. For this example, you can keep it as default, which is deploy.yml.
Choose Create project.

After creating your project, make sure you update the AWS_REGION and SAGEMAKER_PROJECT_NAME environment variables in your GitHub workflow files accordingly. Workflow files are in your GitHub repo (copied from the seed code), inside the .github/workflows directory. Make sure you update both build.yml and deploy.yml files.

...
env:
  AWS_REGION: <region>   
  SAGEMAKER_PROJECT_NAME: <your project name>
...

Now your environment is ready to go! You can run the pipelines directly, make changes, and push those changes to your GitHub repository to trigger the automated build pipeline and see how all the steps of build and deploy are automated.

Clean up

To clean up the resources, complete the following steps:

Delete the CloudFormation stacks used for the SageMaker project and SageMaker endpoints.
Delete the SageMaker domain.
Delete the Service Catalog resources.
Delete the AWS CodeStar connection link with the GitHub repository.
Delete the IAM user that you created for GitHub Actions.
Delete the secret in Secrets Manager that stores the GitHub personal access details.

Summary

In this post, we walked through the process of using a custom SageMaker MLOps project template to automatically construct and organize a CI/CD pipeline. This pipeline effectively integrates your existing CI/CD mechanisms with SageMaker capabilities for data manipulation, model training, model approval, and model deployment. In our scenario, we focused on integrating GitHub Actions with SageMaker projects and pipelines. For a comprehensive understanding of the implementation details, visit the GitHub repository. Feel free to experiment with this and don’t hesitate to leave any queries you might have in the comments section.

About the Authors

Dr. Romina Sharifpour is a Senior Machine Learning and Artificial Intelligence Solutions Architect at Amazon Web Services (AWS). She has spent over 10 years leading the design and implementation of innovative end-to-end solutions enabled by advancements in ML and AI. Romina’s areas of interest are natural language processing, large language models, and MLOps.

Pooya Vahidi is a Senior Solutions Architect at AWS, passionate about computer science, artificial intelligence, and cloud computing. As an AI professional, he is an active member of the AWS AI/ML Area-of-Depth team. With a background spanning over two decades of expertise in leading the architecture and engineering of large-scale solutions, he helps customers on their transformative journeys through cloud and AI/ML technologies.

Gemini Pro API and more new AI tools for developers and enterprises

Launching Gemini Pro API and four more AI tools: Imagen 2, MedLM, and Duet AI for Developers and Duet AI in Security Operations.Read More

Generative AI Can Be Tapped as Fraud Copilot

How Is Generative AI Tackling Misuse and Fraud Detection?

What Are the Benefits of AI for Fraud Detection?

Financial Services Firms Embrace AI for Identity Verification

Harnessing Graph Neural Networks and NVIDIA GPUs

Improving Anomaly Detection With GNNs

GNNs Enable Training Unsupervised or Self-Supervised

Tackling Model Explainability and Bias

Leading Financial Services Firms Embrace AI for Gains

Solution overview

Prerequisites

Deploy solution resources

Run the solution

Review the summary

Expand the solution

Clean up

Conclusion

About the Authors

Solution overview

Efficient Fine-tuning Llama2 using QLoRa

Deploy a fine-tuned Model on Inf2 using Amazon SageMaker

Prerequisites

Walkthrough

Part 1: Fine-tune a Llama2-7b model using PEFT

Part 2: Host QLoRA model for inference with AWS Inf2 using SageMaker LMI Container

Clean up

Conclusion

About the Authors

Delivering With Jetson-Powered Fleets

Flying Away With a Roster of Customers

Riding Jetson Orin for Energy Efficiency, Environmental Benefits

Solution overview

GitHub and GitHub Actions

Prerequisites

Set up an AWS CodeStar connection

Set up secret access keys for your GitHub token

Create an IAM user for GitHub Actions

Set up your GitHub account

Clone the GitHub repository

Create a GitHub secret containing your IAM user access key

Configure your GitHub environments

Deploy the Lambda function

Create a custom project template in SageMaker

Deploy the project from SageMaker Studio

Clean up

Summary

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.